You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add an additional timeout for each incoming chunk of data while downloading artifacts, so we can fail the download and retry instead of getting stuck until the overall download timeout is reached.
Perhaps IdleConnTimeout could help with that but I'm unsure about its behavior. Needs to be tested.
Describe a specific use case for the enhancement or feature:
Currently we have an overall timeout (default: 2h) for the entire download of an artifact.
// Elastic Agent binary is rather large and based on the network bandwidth it could take some time
// to download the full file. 120 minutes is a very large value, but we really want it to finish.
// The HTTP download will log progress in the case that it is taking a while to download.
transport.Timeout=120*time.Minute
Due to recently discovered issues with our CDN #4268, the agent might get stuck in the middle of a download without receiving the next chunk of data from the CDN.
That would lead to hanging on the same connection for 2 hours and then failing to download the artifact. Having separate handling of chunk timeouts would make the agent more resilient and make it more likely to succeed the upgrade.
What is the definition of done?
The agent detects a stuck download, interrupts it and retries.
There are tests validating this behavior
The text was updated successfully, but these errors were encountered:
I did some testing and it turned out to be very tricky to implement the described behavior:
I added a test for the HTTP artifact downloader using an HTTP server with downloads that get stuck
The IdleConnTimeout made absolutely no difference in my testing, the clients get stuck regardless
Implementing a timeout on each received chunk (which is already a very inefficient solution) made no difference either
The main reason is buffering: unless the server flushes its buffer the client does not get unblocked. Flushing buffers would depend on the particular implementation of the server. In Go it has to be done manually unless the buffer is full. So, implementing something on the client side would not 100% guarantee that we solved it.
All the code related to my testing and solution approach can be found here #4605
Describe the enhancement:
Add an additional timeout for each incoming chunk of data while downloading artifacts, so we can fail the download and retry instead of getting stuck until the overall download timeout is reached.
Perhaps
IdleConnTimeout
could help with that but I'm unsure about its behavior. Needs to be tested.Describe a specific use case for the enhancement or feature:
Currently we have an overall timeout (default: 2h) for the entire download of an artifact.
elastic-agent/internal/pkg/agent/application/upgrade/artifact/config.go
Lines 191 to 194 in 1ee9cc7
Due to recently discovered issues with our CDN #4268, the agent might get stuck in the middle of a download without receiving the next chunk of data from the CDN.
That would lead to hanging on the same connection for 2 hours and then failing to download the artifact. Having separate handling of chunk timeouts would make the agent more resilient and make it more likely to succeed the upgrade.
What is the definition of done?
The text was updated successfully, but these errors were encountered: