original_src: Fix ephemeral port exhaustion by setting IP_BIND_ADDRESS_NO_PORT#38288
Conversation
|
Hi @jronak, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
|
cc @klarose @mattklein123 as codeowners. |
adisuissa
left a comment
There was a problem hiding this comment.
drive-by comment:
IIUC this PR modifies the current behavior. If so, the feature should either be configured by a config-knob (i.e., adding this to the filter's API), or if the feature is more of a bugfix then it should be runtime guarded.
mattklein123
left a comment
There was a problem hiding this comment.
Couple of questions:
- What kernel versions is this option available on? Does it work on all kernels we would expect Envoy to run on with other options?
- Presumably it's still possible to fail later when a port is actually bound? Do we correctly handle that behavior?
- Agree with @adisuissa that this should probably be at least runtime guarded if not feature driven.
/wait
The original_src filter binds the upstream socket to the original source IP address by invoking the bind syscall. This works correctly in principle, but in production, we observed ephemeral port exhaustion when the original_src filter was enabled. When a socket is bound to a non-zero IP with a zero port, the kernel assigns an ephemeral port immediately. This port remains unavailable for reuse because the kernel does not know if the socket will eventually connect or listen. To address this, the kernel provides the IP_BIND_ADDRESS_NO_PORT socket option, which disables immediate ephemeral port reservation for sockets intended for connect. Using this option helps prevent ephemeral port exhaustion. Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
Signed-off-by: Ronak Jain <ronakjainc@gmail.com>
100b0d5 to
31301b1
Compare
This socket option has been available since Linux kernel 4.2, i.e. it is consistently available across both SLTS (4.4) and LTS (5.4+) kernel versions.
So an ephemeral port is no longer allocated when bind is called on these sockets. Instead, similar to regular sockets, the ephemeral port is allocated by the kernel during the connect syscall. This means no special handling is required on our end, as the kernel consistently manages port allocation. Additionally, our existing error handling is sufficient to address ephemeral port exhaustion errors.
I agree this needs to be behind runtime guard. I have updated the PR to use reloadable runtime feature. Thanks for the review @adisuissa @mattklein123 |
Commit Message: The original_src filter binds the upstream socket to the source IP address by invoking the bind syscall. This works correctly, but we observed ephemeral port exhaustion in production when the original_src was enabled.
When a socket binds to a non-zero IP with a zero port, the kernel assigns an ephemeral port immediately. This port remains unavailable for reuse because the kernel does not know if the socket will eventually connect or listen.
To address this, the kernel provides the IP_BIND_ADDRESS_NO_PORT socket option, which disables immediate ephemeral port reservation for sockets intended for connection. Using this option helps prevent ephemeral port exhaustion.
Additional Description: N/A
Risk Level: Low
Testing: Unit test
Docs Changes: N/A
Release Notes: N/A