Better connection management when downloading artifacts #4409

rdner · 2024-03-13T12:37:10Z

Describe the enhancement:

Add an additional timeout for each incoming chunk of data while downloading artifacts, so we can fail the download and retry instead of getting stuck until the overall download timeout is reached.

Perhaps IdleConnTimeout could help with that but I'm unsure about its behavior. Needs to be tested.

Describe a specific use case for the enhancement or feature:

Currently we have an overall timeout (default: 2h) for the entire download of an artifact.

elastic-agent/internal/pkg/agent/application/upgrade/artifact/config.go

Lines 191 to 194 in 1ee9cc7

    
           // Elastic Agent binary is rather large and based on the network bandwidth it could take some time 
        
           // to download the full file. 120 minutes is a very large value, but we really want it to finish. 
        
           // The HTTP download will log progress in the case that it is taking a while to download. 
        
           transport.Timeout = 120 * time.Minute

Due to recently discovered issues with our CDN #4268, the agent might get stuck in the middle of a download without receiving the next chunk of data from the CDN.

That would lead to hanging on the same connection for 2 hours and then failing to download the artifact. Having separate handling of chunk timeouts would make the agent more resilient and make it more likely to succeed the upgrade.

What is the definition of done?

The agent detects a stuck download, interrupts it and retries.
There are tests validating this behavior

The text was updated successfully, but these errors were encountered:

elasticmachine · 2024-03-13T12:37:12Z

Pinging @elastic/elastic-agent (Team:Elastic-Agent)

rdner · 2024-04-22T14:33:28Z

I did some testing and it turned out to be very tricky to implement the described behavior:

I added a test for the HTTP artifact downloader using an HTTP server with downloads that get stuck
The IdleConnTimeout made absolutely no difference in my testing, the clients get stuck regardless
Implementing a timeout on each received chunk (which is already a very inefficient solution) made no difference either
The main reason is buffering: unless the server flushes its buffer the client does not get unblocked. Flushing buffers would depend on the particular implementation of the server. In Go it has to be done manually unless the buffer is full. So, implementing something on the client side would not 100% guarantee that we solved it.

All the code related to my testing and solution approach can be found here #4605

rdner added enhancement New feature or request Team:Elastic-Agent Label for the Agent team labels Mar 13, 2024

rdner self-assigned this Mar 13, 2024

rdner mentioned this issue Mar 14, 2024

[Flaky Test]: TestFleetManagedUpgradeUnprivileged, TestFleetManagedUpgradePrivileged – context deadline exceeded #4339

Open

rdner closed this as completed Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better connection management when downloading artifacts #4409

Better connection management when downloading artifacts #4409

rdner commented Mar 13, 2024

elasticmachine commented Mar 13, 2024

rdner commented Apr 22, 2024

Better connection management when downloading artifacts #4409

Better connection management when downloading artifacts #4409

Comments

rdner commented Mar 13, 2024

elasticmachine commented Mar 13, 2024

rdner commented Apr 22, 2024