Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The elastic-agent upgrade command can fail even though the upgrade succeeds #3890

Closed
cmacknz opened this issue Dec 8, 2023 · 10 comments · Fixed by #4519 or #4533
Closed

The elastic-agent upgrade command can fail even though the upgrade succeeds #3890

cmacknz opened this issue Dec 8, 2023 · 10 comments · Fixed by #4519 or #4533
Assignees
Labels
bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team

Comments

@cmacknz
Copy link
Member

cmacknz commented Dec 8, 2023

First observed in https://buildkite.com/elastic/elastic-agent/builds/5460#018c4990-62ec-4dc1-a9c0-81953d923d80. One of the package version tests failed because the elastic-agent ugprade command exited with an EOF error.

=== RUN   TestUpgradeBrokenPackageVersion
    upgrade_broken_package_test.go:59: Testing Elastic Agent upgrade from 8.12.0-SNAPSHOT to 8.11.2-SNAPSHOT...
    fetcher.go:91: Using existing artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz
    fixture.go:261: Extracting artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestUpgradeBrokenPackageVersion2415228739/001
    fixture.go:274: Completed extraction of artifact elastic-agent-8.12.0-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestUpgradeBrokenPackageVersion2415228739/001
    fixture.go:717: Components were not modified from the fetched artifact
    fetcher.go:91: Using existing artifact elastic-agent-8.11.2-SNAPSHOT-linux-arm64.tar.gz
    fixture.go:261: Extracting artifact elastic-agent-8.11.2-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestUpgradeBrokenPackageVersion2415228739/002
    fixture.go:274: Completed extraction of artifact elastic-agent-8.11.2-SNAPSHOT-linux-arm64.tar.gz to /tmp/TestUpgradeBrokenPackageVersion2415228739/002
    fixture.go:717: Components were not modified from the fetched artifact
    fixture.go:520: >> running binary with: [/tmp/TestUpgradeBrokenPackageVersion2415228739/001/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    fixture.go:520: >> running binary with: [/tmp/TestUpgradeBrokenPackageVersion2415228739/002/elastic-agent-8.11.2-SNAPSHOT-linux-arm64/elastic-agent version --binary-only --yaml]
    upgrader.go:167: Installing version "8.12.0-SNAPSHOT"
    fixture_install.go:99: [test TestUpgradeBrokenPackageVersion] Inside fixture install function
    fixture.go:520: >> running binary with: [/tmp/TestUpgradeBrokenPackageVersion2415228739/001/elastic-agent-8.12.0-SNAPSHOT-linux-arm64/elastic-agent install --force --non-interactive]
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent status --output json]
    upgrade_broken_package_test.go:83: package version files found: [data/elastic-agent-e3c069/package.version]
    upgrade_broken_package_test.go:88: removing package version file "/opt/Elastic/Agent/data/elastic-agent-e3c069/package.version"
    upgrader.go:202: Upgrading from version "8.12.0-SNAPSHOT" to version "8.11.2-SNAPSHOT"
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent upgrade 8.11.2-SNAPSHOT --source-uri file:///home/ubuntu/agent/.agent-testing/artifact --skip-verify]
    upgrade_broken_package_test.go:62: 
        	Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_broken_package_test.go:62
        	Error:      	Received unexpected error:
        	            	failed to start agent upgrade to version "8.11.2": exit status 1
        	            	Error initializing version information: reading package version from file "/opt/Elastic/Agent/data/elastic-agent-e3c069/package.version": open /opt/Elastic/Agent/data/elastic-agent-e3c069/package.version: no such file or directory
        	            	Error: Failed trigger upgrade of daemon: rpc error: code = Unavailable desc = error reading from server: EOF
        	            	For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.13/fleet-troubleshooting.html
        	Test:       	TestUpgradeBrokenPackageVersion
    fixture_install.go:127: [test TestUpgradeBrokenPackageVersion] Inside fixture cleanup function
    fixture_install.go:142: collecting diagnostics; test failed
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestUpgradeBrokenPackageVersion-diagnostics-2023-12-08T13:47:58Z.zip]
    fixture_install.go:268: failed to collect diagnostics to /home/ubuntu/agent/build/diagnostics/TestUpgradeBrokenPackageVersion-diagnostics-2023-12-08T13:47:58Z.zip (exit status 1): [WARNING]: failed to fetch agent diagnostics: error in DiagnosticAgent RPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"[WARNING]: failed to fetch unit diagnostics: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"[WARNING]: failed to fetch component diagnostics: error in DiagnosticComponents RPC call: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /run/elastic-agent.sock: connect: connection refused"Error: failed collecting diagnostics: no diags could be fetched
        For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/8.11/fleet-troubleshooting.html
    fixture_install.go:274: retrying in 15 seconds due to connection error; possible Elastic Agent was not fully started
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent diagnostics -f /home/ubuntu/agent/build/diagnostics/TestUpgradeBrokenPackageVersion-diagnostics-2023-12-08T13:47:58Z.zip]
    fixture_install.go:277: failed to collect diagnostics a second time at /home/ubuntu/agent/build/diagnostics/TestUpgradeBrokenPackageVersion-diagnostics-2023-12-08T13:47:58Z.zip (%!s()): Created diagnostics archive "/home/ubuntu/agent/build/diagnostics/TestUpgradeBrokenPackageVersion-diagnostics-2023-12-08T13:47:58Z.zip"
        ***** WARNING *****
        Created archive may contain plain text credentials.
        Ensure that files in archive are redacted before sharing.
        *******************
    fixture.go:520: >> running binary with: [/opt/Elastic/Agent/elastic-agent uninstall --force]
--- FAIL: TestUpgradeBrokenPackageVersion (100.75s)

However the upgrade had actually started looking at the agent logs:

{"log.level":"info","@timestamp":"2023-12-08T13:47:25.250Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":500},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.11.2-SNAPSHOT","state":"UPG_REQUESTED","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:25.250Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":136},"message":"Upgrading agent","log":{"source":"elastic-agent"},"version":"8.11.2-SNAPSHOT","source_uri":"file:///home/ubuntu/agent/.agent-testing/artifact","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:25.250Z","log.origin":{"file.name":"upgrade/cleanup.go","file.line":22},"message":"Cleaning up non-matching downloaded versions","log":{"source":"elastic-agent"},"version":"8.12.0","downloads.path":"/opt/Elastic/Agent/data/elastic-agent-e3c069/downloads","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:25.250Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":500},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.11.2-SNAPSHOT","state":"UPG_DOWNLOADING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:25.250Z","log.origin":{"file.name":"upgrade/step_download.go","file.line":85},"message":"Using local upgrade artifact","log":{"source":"elastic-agent"},"version":"8.11.2-SNAPSHOT","drop_path":"/home/ubuntu/agent/.agent-testing/artifact","target_path":"/opt/Elastic/Agent/data/elastic-agent-e3c069/downloads","install_path":"/opt/Elastic/Agent/data/elastic-agent-e3c069/install","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:25.697Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":500},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.11.2-SNAPSHOT","state":"UPG_EXTRACTING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.888Z","log.origin":{"file.name":"upgrade/step_unpack.go","file.line":42},"message":"Unpacked upgrade artifact","log":{"source":"elastic-agent"},"version":"8.11.2-SNAPSHOT","file.path":"/opt/Elastic/Agent/data/elastic-agent-e3c069/downloads/elastic-agent-8.11.2-SNAPSHOT-linux-arm64.tar.gz","hash":"1c21b0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.888Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":280},"message":"Copying action store","log":{"source":"elastic-agent"},"new_home_path":"/opt/Elastic/Agent/data/elastic-agent-1c21b0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.888Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":284},"message":"Copying action store path","log":{"source":"elastic-agent"},"from":"/opt/Elastic/Agent/data/elastic-agent-e3c069/action_store.yml","to":"/opt/Elastic/Agent/data/elastic-agent-1c21b0/action_store.yml","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.888Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":284},"message":"Copying action store path","log":{"source":"elastic-agent"},"from":"/opt/Elastic/Agent/data/elastic-agent-e3c069/state.yml","to":"/opt/Elastic/Agent/data/elastic-agent-1c21b0/state.yml","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.888Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":284},"message":"Copying action store path","log":{"source":"elastic-agent"},"from":"/opt/Elastic/Agent/data/elastic-agent-e3c069/state.enc","to":"/opt/Elastic/Agent/data/elastic-agent-1c21b0/state.enc","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.888Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":306},"message":"Copying run directory","log":{"source":"elastic-agent"},"new_run_path":"/opt/Elastic/Agent/data/elastic-agent-1c21b0/run","old_run_path":"/opt/Elastic/Agent/data/elastic-agent-e3c069/run","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.891Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":402},"message":"ignoring error: failed to copy \"/opt/Elastic/Agent/data/elastic-agent-e3c069/run\" to \"/opt/Elastic/Agent/data/elastic-agent-1c21b0/run\": lstat /opt/Elastic/Agent/data/elastic-agent-e3c069/run: no such file or directory","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.891Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":500},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.11.2-SNAPSHOT","state":"UPG_REPLACING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.891Z","log.origin":{"file.name":"upgrade/step_relink.go","file.line":42},"message":"Changing symlink","log":{"source":"elastic-agent"},"symlink_path":"/opt/Elastic/Agent/elastic-agent","new_path":"/opt/Elastic/Agent/data/elastic-agent-1c21b0/elastic-agent","prev_path":"/opt/Elastic/Agent/elastic-agent.prev","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.893Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":500},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.11.2-SNAPSHOT","state":"UPG_WATCHING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.893Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":131},"message":"Writing upgrade marker file","log":{"source":"elastic-agent"},"file.path":"/opt/Elastic/Agent/data/.update-marker","hash":"1c21b0","prev_hash":"e3c069","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.893Z","log.origin":{"file.name":"upgrade/step_mark.go","file.line":146},"message":"Updating active commit","log":{"source":"elastic-agent"},"file.path":"/opt/Elastic/Agent/.elastic-agent.active.commit","hash":"1c21b0","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.894Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":126},"message":"Starting upgrade watcher","log":{"source":"elastic-agent"},"path":"/opt/Elastic/Agent/elastic-agent","args":["/opt/Elastic/Agent/elastic-agent","watch","--path.config","/opt/Elastic/Agent","--path.home","/opt/Elastic/Agent"],"env":[],"dir":"","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.894Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":133},"message":"Upgrade Watcher invoked","log":{"source":"elastic-agent"},"agent.upgrade.watcher.process.pid":24106,"agent.process.pid":24072,"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.894Z","log.origin":{"file.name":"upgrade/rollback.go","file.line":121},"message":"releasing watcher 24106","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.894Z","log.origin":{"file.name":"upgrade/upgrade.go","file.line":217},"message":"Removing downloads directory","log":{"source":"elastic-agent"},"file.path":"/opt/Elastic/Agent/data/elastic-agent-e3c069/downloads","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.913Z","log.origin":{"file.name":"coordinator/coordinator.go","file.line":500},"message":"updated upgrade details","log":{"source":"elastic-agent"},"upgrade_details":{"target_version":"8.11.2-SNAPSHOT","state":"UPG_RESTARTING","metadata":{}},"ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-12-08T13:47:58.914Z","log.origin":{"file.name":"cmd/run.go","file.line":303},"message":"reexec shutdown channel triggered","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}

If we look at the log timestamps in Buildkite for when the failure occurred:

2023-12-08 08:58:19 EST | >>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): upgrade_broken_package_test.go:62:
-- | --
  | 2023-12-08 08:58:19 EST | >>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): Error Trace:	/home/ubuntu/agent/testing/integration/upgrade_broken_package_test.go:62
  | 2023-12-08 08:58:19 EST | >>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): Error:      	Received unexpected error:
  | 2023-12-08 08:58:19 EST | >>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): failed to start agent upgrade to version "8.11.2": exit status 1
  | 2023-12-08 08:58:19 EST | >>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): Error initializing version information: reading package version from file "/opt/Elastic/Agent/data/elastic-agent-e3c069/package.version": open /opt/Elastic/Agent/data/elastic-agent-e3c069/package.version: no such file or directory
  | 2023-12-08 08:58:19 EST | >>> (linux-arm64-ubuntu-2204-upgrade) Test output (sudo) (stdout): Error: Failed trigger upgrade of daemon: rpc error: code = Unavailable desc = error reading from server: EOF

We see the agent reexec-ed at 2023-12-08T13:47:58.914 and the elastic-agent upgrade error occurred later at 2023-12-08 08:58:19 (the timezones don't match the the minutes and second are accurate).

It seems as if the agent re-execed before responding to the elastic-agent upgrade command. We can either fix that or make elastic-agent upgrade asynchronous and expect users and our test to watch upgrade progress through the elastic-agent status command.

@cmacknz cmacknz added bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team labels Dec 8, 2023
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@cmacknz cmacknz changed the title The elastic-agent upgrade command can fail if though the upgrade succeeds The elastic-agent upgrade command can fail even though the upgrade succeeds Dec 8, 2023
@rdner rdner self-assigned this Feb 6, 2024
@pchila
Copy link
Contributor

pchila commented Feb 12, 2024

New occurrence in https://buildkite.com/elastic/elastic-agent/builds/6981
The EOF seems related to the GRPC server Stop() call instead of GracefulStop() in run.go

	// start the control listener
	if err := control.Start(); err != nil {
		return err
	}
	defer control.Stop()

According to grpc documentation:

Stop stops the gRPC server. It immediately closes all open connections and listeners. It cancels all active RPCs on the server side and the corresponding pending RPCs on the client side will get notified by connection errors.

I will try to change to a GracefulStop() and see how the agent does

@AndersonQ
Copy link
Member

@cmacknz
Copy link
Member Author

cmacknz commented Mar 11, 2024

@rdner
Copy link
Member

rdner commented Mar 21, 2024

@pchila pchila removed their assignment Mar 22, 2024
@rdner
Copy link
Member

rdner commented Mar 22, 2024

@rdner rdner self-assigned this Mar 22, 2024
@cmacknz
Copy link
Member Author

cmacknz commented Mar 22, 2024

The simplest option here is probably stop treating EOF from the Upgrade RPC call happening here as an error.

We can print out a message to check elastic-agent status to track the progress of the upgrade when this happens instead. Our tests will fail if the upgrade actually doesn't happen, we check the final running commit hash and compare it to the expected one.

@rdner
Copy link
Member

rdner commented Mar 22, 2024

@cmacknz I'll see what I can do next week.

@rdner
Copy link
Member

rdner commented Apr 7, 2024

Looks like it's still failing if we upgrade from a version that does not have the fix (https://buildkite.com/elastic/elastic-agent/builds/8162#018eb48e-5cdf-4daf-bf77-561e91e642ee). Despite the check in upgrader I introduced

isConnectionInterrupted := strings.Contains(err.Error(), "Unavailable") && strings.Contains(err.Error(), "EOF")
if !isConnectionInterrupted {
return fmt.Errorf("failed to start agent upgrade to version %q: %w\n%s", endVersionInfo.Binary.Version, err, upgradeOutput)
}

Looks like I need to check for a substring in the command output, not the error message. Will open another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working flaky-test Unstable or unreliable test cases. Team:Elastic-Agent Label for the Agent team
Projects
None yet
5 participants