Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better connection management when downloading artifacts #4409

Closed
rdner opened this issue Mar 13, 2024 · 2 comments
Closed

Better connection management when downloading artifacts #4409

rdner opened this issue Mar 13, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team

Comments

@rdner
Copy link
Member

rdner commented Mar 13, 2024

Describe the enhancement:

Add an additional timeout for each incoming chunk of data while downloading artifacts, so we can fail the download and retry instead of getting stuck until the overall download timeout is reached.

Perhaps IdleConnTimeout could help with that but I'm unsure about its behavior. Needs to be tested.

Describe a specific use case for the enhancement or feature:

Currently we have an overall timeout (default: 2h) for the entire download of an artifact.

// Elastic Agent binary is rather large and based on the network bandwidth it could take some time
// to download the full file. 120 minutes is a very large value, but we really want it to finish.
// The HTTP download will log progress in the case that it is taking a while to download.
transport.Timeout = 120 * time.Minute

Due to recently discovered issues with our CDN #4268, the agent might get stuck in the middle of a download without receiving the next chunk of data from the CDN.

That would lead to hanging on the same connection for 2 hours and then failing to download the artifact. Having separate handling of chunk timeouts would make the agent more resilient and make it more likely to succeed the upgrade.

What is the definition of done?

  • The agent detects a stuck download, interrupts it and retries.
  • There are tests validating this behavior
@rdner rdner added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Mar 13, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

@rdner
Copy link
Member Author

rdner commented Apr 22, 2024

I did some testing and it turned out to be very tricky to implement the described behavior:

  • I added a test for the HTTP artifact downloader using an HTTP server with downloads that get stuck
  • The IdleConnTimeout made absolutely no difference in my testing, the clients get stuck regardless
  • Implementing a timeout on each received chunk (which is already a very inefficient solution) made no difference either
  • The main reason is buffering: unless the server flushes its buffer the client does not get unblocked. Flushing buffers would depend on the particular implementation of the server. In Go it has to be done manually unless the buffer is full. So, implementing something on the client side would not 100% guarantee that we solved it.

All the code related to my testing and solution approach can be found here #4605

@rdner rdner closed this as completed Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

No branches or pull requests

2 participants