-
Notifications
You must be signed in to change notification settings - Fork 554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Networking robustness and resiliency on Azure and beyond (AWS, GCP, AliCloud) #10779
Comments
Elsewhere on KafkaOn confluentinc/librdkafka#3109, we already outlined similar observations with Apache/Confluent Kafka. There are some discussions about how they solved these problems from the client side within librdkafka which are also worth a read. Quotes
Magnus Edenhill (@edenhill) says at confluentinc/librdkafka#2845 (comment):
Joakim Blach Andersen (@JoakimBlach) says at confluentinc/librdkafka#2845 (comment):
Magnus Edenhill (@edenhill) says at confluentinc/confluent-kafka-dotnet#1305 (comment):
|
MitigationsIntroductionWhere applicable, Microsoft recommends to use keepalives to reset the outbound idle timeout, see also https://docs.microsoft.com/en-us/azure/load-balancer/troubleshoot-outbound-connection#idletimeout. ExamplesWithin confluentinc/librdkafka#3109 (comment), we outlined how these things might be mitigated by applying respective TCP keepalive settings on the host OS level. If that is not possible, the improvement crate/crate-python#374 by @chaudum unlocks the option to achieve that on the application level for the Python client (available for Linux Kernel TCP settingsIn the same spirit, I would also like to outline some recommended settings for mitigating the "Azure LB closing idle network connections" on the host level. These settings probably might be applied to the nodes (VMs) as well as the Kubernetes PODs/containers.
Application settings for Pythonconnection = client.connect(
"https://localhost:4200/",
"socket_keepalive": True,
"socket_tcp_keepidle": 120,
"socket_tcp_keepintvl": 30,
"socket_tcp_keepcnt": 8,
) -- https://crate.io/docs/python/en/latest/connect.html#socket-options Resources
|
Elsewhere on the CloudFollowing up on cloudfoundry/guardian#70 and cloudfoundry/guardian#165, I would also like to quote some valuable comments from others. @arjenw says at cloudfoundry/guardian#165 (comment):
@krumts says at cloudfoundry/guardian#165 (comment):
|
Thanks for gathering the information I'm closing this as there is no further actionable item here and the information can still be found via the search. |
Also applied on CrateDB Cloud. |
Hi there,
first things first: Thanks for the tremendous amount of work you are putting into CrateDB. You know who you are.
Introduction
This is not meant to be a specific bug report, but a general heads-up RCA as we believe the issues we have been experiencing when connecting to CrateDB within Azure environments are worth to share with the community.
We wanted to share our findings as a wrapup and future reference for others looking for similar issues. In this manner, apologies for not completing the checklist. The issue can well be closed right away.
The topics are spanning the area of cloud/managed networking (problems) in general, as well as things related to CrateDB and respective interactions from its client drivers. Specifically, these networking problems might happen on all environments where connections are going through a NAT gateway.
So, here we go.
General research
Azure LB closing idle network connections
The problem here is that Azure network loadbalancing components silently drop idle network connections after 4 minutes.
The LB does not even bother to send RST packets to each of the communication partners, so client and server sockets will most probably try to reuse these dead connections.
In turn, services will be hitting individual socket timeouts or otherwise the Kernel will be doing retransmissions with backoff for another 15+ minutes until it considers the connection to be dead.
Quotes
Resources
With kind regards,
Andreas.
The text was updated successfully, but these errors were encountered: