Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envoy: Patch original destination cluster to handle parallel host instantiation #778

Merged
merged 1 commit into from
Jun 3, 2024

Conversation

jrajahalme
Copy link
Member

@jrajahalme jrajahalme commented May 22, 2024

Original destination cluster creates Host instances for the destination addresses on-demand. When multiple worker threads need to forward to the same destination at the same time (as is possible with curl --parallel, for example), separate instances of the destination Host objects are created. Meanwhile, Envoy uses HostSharedPtr as a map key in connection pool containers, which leads to multiple different connections being created for the same destination, even from the same worker, as the originally used Host instance is replaced with another, depending on which worker thread gets to add their Host object to the shared cluster state first. Multiple connections to the same destination fail when the original source address and port is being used, as is typically needed for Cilium policy enforcement, due to socket bind failing with cannot bind requested address error.

This PR fixes this by modifying the original destination cluster implementation to only create one Host object for any destination for any one cluster instance. This requires closer coordination between the worker threads in a form of an additional mutex, on which a write lock is taken whenever a new Host needs to be created. This synchronizes creation of new Hosts between the threads so that a new Host instance is created only if one has not yet been created by any of the worker threads.

Another possible way of fixing this issue would be to replace HostSharedPtr with the destination address as a map key, wherever it is used. This strategy could fix related issues with other cluster and/or load balancer types, it they were to exist. Upstream discussion is needed to settle on the proper fix, and we can adapt as needed when that is done.

It is yet to be decided if and when this should be backported.

@jrajahalme jrajahalme added the wip work-in-progress, no need to review label May 22, 2024
@jrajahalme jrajahalme requested a review from a team as a code owner May 22, 2024 13:19
@jrajahalme jrajahalme requested review from sayboras and removed request for a team May 22, 2024 13:19
@jrajahalme jrajahalme marked this pull request as draft May 22, 2024 13:19
@jrajahalme jrajahalme force-pushed the envoy-conn-pool-debug branch 11 times, most recently from 57d4adb to 3ed83d0 Compare May 24, 2024 18:03
@jrajahalme jrajahalme marked this pull request as ready for review May 24, 2024 18:04
@jrajahalme jrajahalme added bug Something isn't working and removed wip work-in-progress, no need to review labels May 28, 2024
@jrajahalme jrajahalme changed the title debug envoy conn pool envoy: Patch original destination cluster to handle parallel host instantiation May 28, 2024
Copy link
Member

@sayboras sayboras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to enable related upstream tests to make sure that we didn't break anything ?

Patch original destination cluster to avoid multiple hosts for the same
address.

Connection pool containers use HostSharedPtr as map keys, rather than the
address of the host. This leads to multiple connections when there are
multiple Host instances for the same address. This is breaking use of the
original source address and port for upstream connections since only one
such connection can exist at any one time.

Original destination cluster implementation creates such duplicate Host
instances when two worker threads are racing to create a Host for the
same destination at the same time.

Fix this by keeping a separate 'updates_map' where each worker places a
newly created Host for the original destination. This map is used to look
for the Host is it can not be found from the shared read-only
'host_map'. Access to 'updates_map' is syncronized so that it can be
safely shared by the worker threads. The main threads consolidates the
updates from the 'updates_map' to a new instance of the shared, read-only
hosts map, so that the workers do not need to stall for possibly large
map updates.

Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>
@jrajahalme jrajahalme added this pull request to the merge queue Jun 3, 2024
Merged via the queue into main with commit d34c0d3 Jun 3, 2024
5 checks passed
@jrajahalme jrajahalme deleted the envoy-conn-pool-debug branch June 3, 2024 19:10
@sayboras sayboras mentioned this pull request Jun 3, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-done/1.28 bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants