Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Agent-Upgrade]: For Linux .tar deploy; Agent goes Unhealthy on upgrade with Endpoint Security #148

Closed
amolnater-qasource opened this issue Sep 6, 2021 · 69 comments
Assignees
Labels
8.1-candidate bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team v8.1.0

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented Sep 6, 2021

Kibana version: 7.15.0 Snapshot Kibana Cloud environment

Host OS and Browser version: VSphere Ubuntu and MAC, All

Build details:

Build: 43937
Commit: d4c2d0476c622ba2314ab35c3439f1fad4dc0b34
Artifact Link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.14.1-linux-x86_64.tar.gz

Preconditions:

  1. 7.15.0 Snapshot Kibana Cloud environment should be available.
  2. 7.14.1 released Agent must be installed with Default policy having System and Endpoint Security integration.

Steps to reproduce:

  1. Login to Kibana environment.
  2. Trigger Agent upgrade from Fleet UI for 7.14.1 release agent.
  3. Observe agent went Unhealthy after upgrade.

Debug level Logs:
logs.zip
endpoint-000000.zip

Note:

  • This issue is observed on Vsphere machines name: linux qa-ubuntu20.04-desktop and mac qa-mac-bigsur-11.0.1-release-nosip-clone-base
  • This issue is not observed on AWS-Ubuntu 20

Expected Result:
7.14.1 Ubuntu .tar agent should upgrade to 7.15.0 with Endpoint Security and should remain Healthy.

Screenshots:
5
6

@amolnater-qasource amolnater-qasource added bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent Label for the Agent team Team:Fleet Label for the Fleet team labels Sep 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/fleet (Team:Fleet)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Reviewed & assigned to @blakerouse

CC: @EricDavisX

@EricDavisX
Copy link
Contributor

We had other recent upgrade tests passing successfully, I believe this will be specific to either Ubuntu (unlikely?) or specific to something in use or set up on the vm.

@amolnater-qasource it would be helpful to confirm if a straight install of 7.15-snapshot works on that same Ubuntu 20 image. If not, then it isn't necessarily an upgrade bug which helps reducing triage effort. Thank you for testing on AWS based Ubuntu - that is very helpful already.

@blakerouse
Copy link
Contributor

Yeah I wonder if this is really an upgrade issue and more of an issue with Endpoint? Seems that the upgrade worked but Endpoint is having an issue.

@blakerouse blakerouse removed their assignment Sep 7, 2021
@amolnater-qasource
Copy link
Author

Hi @EricDavisX
As per feedback we have installed 7.15.0 Snapshot agent on the same Ubuntu 20 machine and observed no issues.
Further we have also attempted a reboot, however got no errors.

We have revalidated linux agent upgrade issue too, however today we are not able to upgrade Linux tar agent and agent is getting Unhealthy.
This is tested with and without Endpoint Security and we got no success for agent upgrade.

Screenshot:
7

We are successfully able to upgrade Windows and MAC with Endpoint security and getting no Unhealthy status.

It seems like these Upgrade issues on linux are due to VSphere issue as on AWS Ubuntu 20 we are able to upgrade Agents successfully with no errors.
8

Build details:
Build: 43957
Commit: 0239ff6864dd9930cfe9bcd9a679272f2b7465c2
Artifact Link: https://www.elastic.co/downloads/past-releases/elastic-agent-7-14-1

cc: @blakerouse
Thanks

@blakerouse
Copy link
Contributor

Glad to hear it was just an environment issue.

@amolnater-qasource
Copy link
Author

Further, We will re-test this issue once VSphere Linux machines will not show errors and will share test results here.

Thanks

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Sep 13, 2021

Hi @blakerouse
We have re-attempted to upgrade linux .tar agent on Vsphere Linux Ubuntu 20 (2 different VMs with no errors) and 1 Centos 8 machine. We have found this issue reproducible on all 3.

  • We are unable to upgrade Linux .tar agent from these Vsphere machines.

3

Build details:
Build: 44006
Commit: 8e250b3e431fe51eea966a0722a691ce70052225

This issue is not reproducible on AWS machines.
Hence we are re-opening this issue.

Thanks

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated this issue with shipped 7.14.2 linux .tar agent upgrade on 7.15.0 shipped build and found this issue still reproducible.

  • Unable to upgrade Linux (.tar) 7.14.2 agent to 7.15.0 agent.

Build details:
Build: 44040
Commit: add5d2c5ebeba1d8bcf6a79f8863cd78760e1b3e
Artifact Link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.14.2-linux-x86_64.tar.gz

Screenshot:
13

Thanks

@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have attempted to Upgrade 7.15.0 released linux .tar agent on 7.16.0 Snapshot and found similar issue there.

  • Agent didn't upgrade to 7.16.0.
  • Agent went Unhealthy.
  • However after approximately 15 minutes, unknowingly agent got upgraded to 7.16.0 Snapshot.

Build details:
BUILD: 44811
COMMIT: 53fa6f53a005b244e70b5a696a8742d2fd1706f0
ARTIFACT LINK: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.15.0-linux-x86_64.tar.gz

Logs:
endpoint-000000.zip
logs.zip

Please let us know if anything else is required.
cc: @blakerouse
Thanks

@EricDavisX
Copy link
Contributor

However after approximately 15 minutes, unknowingly agent got upgraded to 7.16.0 Snapshot.

Thanks for the continued follow up.

This comment is interesting to unpack, I wonder if our process getting hung and timing out, but needlessly, then continuing on successfully after the fact. @michalpristas do you have any thoughts?

@andresrc I'm curious if we should put this to our 'urgent review' list to spend time or if we want to accept the risk that other Ubutunu 20 hosts work ok, just the ones in the team vSphere cluster that seem to be configured with something preventative (or something delaying the success).

@andresrc andresrc added Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team and removed Team:Elastic-Agent Label for the Agent team labels Oct 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

@blakerouse blakerouse self-assigned this Oct 6, 2021
@blakerouse
Copy link
Contributor

I just tested this on Ubuntu 20.04.

I used Staging Cloud to create a 7.14.2 stack.

I enrolled the Elastic Agent 7.14.2 on the Ubuntu 20.04 into the Default Policy. I added the Endpoint Security integration and the Elastic Agent reported Healthy.

I then upgraded the stack to 7.15.1 in the staging cloud. Once complete I then selected Upgrade agent from the Kibana UI. The 7.14.2 Elastic Agent upgraded to 7.15.1 successfully and reported Healthy.

I let it run for a few hours and it remains Healthy.

@blakerouse
Copy link
Contributor

@amolnater-qasource Looking at the logs from your last comment I do see the following in the logs:

{"log.level":"info","@timestamp":"2021-10-05T07:37:21.177Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-10-05T03:37:21-04:00 - message: Application: endpoint-security--7.16.0-SNAPSHOT[1ab6c708-b4bc-4f4a-979d-b09caf09181f]: State changed to CONFIG:  - type: 'STATE' - sub_type: 'CONFIG'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-10-05T07:37:33.053Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-10-05T03:37:33-04:00 - message: Application: endpoint-security--7.16.0-SNAPSHOT[1ab6c708-b4bc-4f4a-979d-b09caf09181f]: State changed to DEGRADED: Protecting with policy {c54a8dde-1cce-4dd7-a2b8-83fd02d61eab} - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2021-10-05T07:37:33.053Z","log.origin":{"file.name":"status/reporter.go","file.line":236},"message":"Elastic Agent status changed to: 'degraded'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-10-05T07:42:01.206Z","log.origin":{"file.name":"status/reporter.go","file.line":236},"message":"Elastic Agent status changed to: 'online'","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2021-10-05T07:42:01.206Z","log.origin":{"file.name":"log/reporter.go","file.line":40},"message":"2021-10-05T03:42:01-04:00 - message: Application: endpoint-security--7.16.0-SNAPSHOT[1ab6c708-b4bc-4f4a-979d-b09caf09181f]: State changed to RUNNING: Protecting with policy {c54a8dde-1cce-4dd7-a2b8-83fd02d61eab} - type: 'STATE' - sub_type: 'RUNNING'","ecs.version":"1.6.0"}

Seems Endpoint Security stayed degraded for up to 5 minutes.

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Oct 15, 2021

Hi @blakerouse
We have revalidated this today with various ways on 7.16.0 Snapshot.
Note:
As per reported issue elastic/kibana#115015 , hosted fleet server was not available so we have installed our own Fleet Server to test agent upgrades.

VMs used:
Ubuntu 20
Centos 8.0

Build details:
Build: 45084
Commit: e9f058c53f486add5530b2540855a727a8e79f71
Artifact Link[7.15.1 agent]: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.15.1-linux-x86_64.tar.gz

We observed below Inconsistent behaviours on installing-upgrading linux .tar agents:

  1. On installation once both the 7.15.1 agents(Ubuntu 20, Centos 8.0 tar agents) had no logs.
  2. On installing again, logs appeared for the second time. [Above issue resolved]
  3. Further on upgrading both the agents stuck in Updating state:
  • Linux .tar agent on Ubuntu 20 upgraded to 7.16.0 however agent kept showing in Updating state.
  • Linux .tar agent on Centos 8 didn't upgrade to 7.16.0 however remain stuck in Updating state.

Screenshot:
14

Ubuntu 20 logs:
logs.zip
endpoint-000000.zip

We will re-attempt this once a new snapshot build will be available.

cc: @EricDavisX
Thanks

@blakerouse
Copy link
Contributor

@amolnater-qasource Are you testing upgrading an Elastic Agent that is also running the Fleet Server? I think we need to separate upgrading and Elastic Agent without a Fleet Server and upgrading an Elastic Agent with a Fleet Server.

My testing was around using Fleet Server in the cloud, and with release versions of Stack and Elastic Agent, which do not have the current Kibana issue.

Can you try the steps I performed to see if your host shows successful there? It would be great to have a baseline of a known working path so we can determine the bad path.

@amolnater-qasource
Copy link
Author

Hi @michalpristas

just modify elastic-agent.yml and add agent.download.timeout: 300

We have updated the elastic-agent.yml with agent.download.timeout: 300.
We attempted to upgrade 7.15.2 released agent on latest snapshot. However we are still not able to upgrade the agent.

Screenshot:
12

Build details:
Build: 45910
Commit: af229de1a1dfbe3a65089c75e178351c9d49e68d
Artifact Link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.15.2-linux-x86_64.tar.gz

Thanks

@michalpristas
Copy link
Contributor

client timeout still, either increase timeout even more, check connectivity and if by any case connection is not dropped by firewall.

@amolnater-qasource
Copy link
Author

Hi @michalpristas
As per feedback we have increased agent.download.timeout: from 300 to 6000.
We have attempted the same on two different Ubuntu 20 and one centos 8 VM.
We still got the same error for all the three agents on triggering upgrade from Kibana UI.
Further to check connectivity, we have tried to connect using ping. We are successfully able to communicate through ping with these VMs.

Screenshot:
1

Please let us know if anything else is required.

Thanks

@EricDavisX EricDavisX changed the title [Agent-Upgrade]: 7.14.1 Ubuntu .tar agent went Unhealthy on upgrade with Endpoint Security [Agent-Upgrade]: For Linux & macOS .tar deploy; Agent goes Unhealthy on upgrade with Endpoint Security Nov 29, 2021
@EricDavisX EricDavisX changed the title [Agent-Upgrade]: For Linux & macOS .tar deploy; Agent goes Unhealthy on upgrade with Endpoint Security [Agent-Upgrade]: For Linux .tar deploy; Agent goes Unhealthy on upgrade with Endpoint Security Nov 29, 2021
@jlind23 jlind23 added the v8.1.0 label Dec 7, 2021
@amolnater-qasource
Copy link
Author

Hi @EricDavisX
We have revalidated upgrading Linux .tar 7.16.0 Snapshot agent to latest version on latest 8.0 Snapshot.

  • We are still not able to upgrade Linux .tar agent.

Build details:
Build: 48594
Commit: ad3660f3acbfe6eb809d869b908221edf2846313

We are successfully able to upgrade Windows and MAC agents.

Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jan 6, 2022

hey @amolnater-qasource is it something still up to date?

@EricDavisX
Copy link
Contributor

@blakerouse can you confirm which file and value can be changed on the Agent host files to increase the timeout to re-test this? We had discussed a 10 minute value might be high enough to add more confidence in the file download.

@blakerouse
Copy link
Contributor

We have a similar SDH that is reporting that changing that value does not fix the issue. So it might be that either the setting is not working or another timeout is occurring that we do not know about currently.

@amolnater-qasource
Copy link
Author

Hi @jlind23
We have revalidated this issue on upgrading Linux agent from 7.17>8.0 on latest 8.0 Snapshot and we are still not able to upgrade linux .tar agent.

  • We have attempted on a fresh cloned Ubuntu 20 VM and found it still reproducible.

Build details:
BUILD 48933
COMMIT 2fa075fc23e8e5e78c862cd6518fdcd3430ae1f7
7.17 artifact: https://staging.elastic.co/7.17.0-2a228a35/downloads/beats/elastic-agent/elastic-agent-7.17.0-linux-x86_64.tar.gz

Screenshot:
9

Please let us know if anything else is required from our end.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Jan 11, 2022

@blakerouse how can we move this forward then?

@blakerouse
Copy link
Contributor

@jlind23 I am going to start working on a why its not working and a proper fix.

@amolnater-qasource
Copy link
Author

Hi @blakerouse
We have attempted to upgrade 7.17 linux.tar agent from different OS's on 8.0 Snapshot.
We had below observations:

OS Platform Upgrade Status
Ubuntu 20 VSphere FAIL
Centos 8 VSphere FAIL
Ubuntu 20 AWS PASS

cc: @jlind23
Thanks!

@amolnater-qasource
Copy link
Author

On further testing we are successfully able to upgrade 7.17 Linux .tar fleet server from Centos 8 (on 2nd attempt).
However unable to upgrade the same on Ubuntu 20- VSphere VM: Failed on all attempts.
Thanks!

@jlind23
Copy link
Contributor

jlind23 commented Feb 2, 2022

Hey @blakerouse could you please take a look here? You were working on a fix, did you find anything?

@amolnater-qasource
Copy link
Author

Hi @jlind23
We have revalidated upgrading linux agent on 8.0 Snapshot using various VSphere Ubuntu 20 VMs and we have found it fixed now.

  • We are successfully able to upgrade linux agent from 7.17>8.0.

Build details:
BUILD: 49221
COMMIT: 553220be20ca149522836c887ad6d59edac5ddd8
Artifact Link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.17.0-linux-x86_64.tar.gz

Screenshot:
2
3

Hence we are closing this issue.
Thanks

@amolnater-qasource
Copy link
Author

amolnater-qasource commented Mar 1, 2022

Hi @jlind23
We have attempted to upgrade linux agent on 7.17.1>8.1 Snapshot and found this issue again reproducible.

  • Linux .tar Agent goes Unhealthy on upgrade.

TESTED WITH UBUNTU VSPHERE VM
Build details:
BUILD: 50454
COMMIT: 4194b8f2e55cd6cfc29e8ffcbfd09adf8c0448b6
Artifact Link: https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-7.17.1-linux-x86_64.tar.gz

Logs:
elastic-agent-diagnostics-2022-03-01T07-26-45Z-00.zip

Hence we are re-opening this issue.
Thanks

@jlind23
Copy link
Contributor

jlind23 commented Mar 1, 2022

@ph @ruflin tell me if i'm wrong but the migration path is to first upgrade to 8.0 and only then upgrade to a later 8.X right?

@ruflin
Copy link
Member

ruflin commented Mar 1, 2022

I would expect 7.17 to 8.1 to also work.

@ph
Copy link
Contributor

ph commented Mar 1, 2022

I would also expect the same 7.17->8.1. I am also part of the school that wait the first minor to upgrade..

@ulab
Copy link

ulab commented Mar 2, 2022

I am experiencing similar issues while upgrading from 8.0.0 to 8.0.1.

Some agents upgraded fine, others won't. The Fleet server is one of those stubborn ones.

The base system is always Debian 11.2 on a vmWare machine.

The files themselves download fine with curl and wget in less than two seconds. When the agent tries to download them, the files are created in the download directory, but stay at 0 bytes.

I have tried downloading the files manually into the download folder and retry the update and it seems to work around the issue. I am just not sure if it uses the manually downloaded files if they exist or if the download randomly worked that time.

@jlind23
Copy link
Contributor

jlind23 commented Mar 8, 2022

As discussed with @blakerouse - `Closing that one in favour of: #104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.1-candidate bug Something isn't working impact:high Short-term priority; add to current release, or definitely next. Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team Team:Fleet Label for the Fleet team v8.1.0
Projects
None yet
Development

No branches or pull requests