-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Description
Current Behavior
Issue Description
We are observing a significant discrepancy within a single Apache APISIX pod where the aggregated Prometheus metric for active connections (apisix_nginx_http_current_connections{state="active"}) reports a much higher number than the actual number of open TCP sockets reported by the operating system's lsof command for all Nginx processes within that pod.
This over-reporting seems related to Nginx worker process recycling, suggesting a flaw in how the nginx-lua-prometheus module or APISIX aggregation handles statistics from older, draining worker processes.
The investigation is based on the following specific data observed within a single APISIX Kubernetes pod:
1. System Metrics via curl (Prometheus Endpoint)
The endpoint reported these values for current connections:
apisix_nginx_http_current_connections{state="active"} 58193
apisix_nginx_http_current_connections{state="accepted"} 234773641
apisix_nginx_http_current_connections{state="handled"} 234773641
apisix_nginx_http_current_connections{state="waiting"} 58076
apisix_nginx_http_current_connections{state="writing"} 89
2. Operating System File Descriptor Count (lsof)Running lsof across all Nginx PIDs within that same pod returned a significantly lower number:
lsof -iTCP -a -p $(pgrep nginx | tr '\n' ',') | wc -l
39385
3. Process List (ps -eaf and stat)
The process list revealed four total worker processes, with two running much longer than the others, indicating a configuration reload event occurred:
# Process list output subset
PID USER TIME COMMAND
51 apisix 1d06 {openresty} nginx: worker process # Long running (~1 day CPU time)
52 apisix 1d06 {openresty} nginx: worker process # Long running (~1 day CPU time)
114 apisix 33:48 {openresty} nginx: worker process # Short running (~34 minutes CPU time)
116 apisix 23:32 {openresty} nginx: worker process # Short running (~24 minutes CPU time)
# Timestamp of new worker process 114 start time
stat /proc/114
Change: 2025-11-23 09:38:44.660489426 +0000
4. System Configuration
Configured Limit: worker_connections 10620;
Total Expected Max Connections (Per Pod): 4 workers * 10620 connections/worker = 42,480 total connections.
Analysis of Observations and Inconsistencies
We have identified two primary inconsistencies based on this data:
Inconsistency 1: Metric Count vs. OS Count
The Nginx metric reports ~58,193 active connections.
The actual operating system reality (lsof) reports ~39,385 open TCP sockets (file descriptors).
The lsof count is the ground truth for actual resource consumption. The Nginx metric is overstating the active connection count by nearly 19,000 connections (roughly the number of active connections 2 worker process might hold).
Inconsistency 2: Active Connections vs. Configured Limit
The OS-level count of ~39k is actually within the configured maximum of 42,480. However, the Nginx metric reports ~58k, suggesting the system is vastly exceeding its limits, when in reality, it is not.
Conclusion on the Discrepancy
The most logical explanation for these observations is a bug in how the APISIX Prometheus metrics module aggregates statistics across different generations of Nginx worker processes within the same pod.
When Nginx reloads gracefully (starting PIDs 114 and 116 while PIDs 51 and 52 drain), the metrics module seems to be incorrectly summing the statistics of both the old, draining workers and the new, active workers simultaneously, resulting in a misleadingly high "active" count that doesn't reflect the actual consumed file descriptors shown by lsof.
The true number of active connections for this pod is closer to 39,385.
Expected Behavior
The true number of active connections for this pod is closer to 39,385.
Error Logs
No response
Steps to Reproduce
Observed in production deployment
Environment
-
APISIX version (run
apisix version):3.14.1 -
Operating system (run
uname -a):Linux data-plane-7c4b955754-dp552 5.10.230-223.885.amzn2.x86_64 change: added doc of how to load plugin. #1 SMP Tue Dec 3 14:36:00 UTC 2024 x86_64 GNU/Linux -
OpenResty / Nginx version (run
openresty -Vornginx -V):nginx version: openresty/1.27.1.2 (x86_64-pc-linux-gnu)
built by gcc 15.2.0 (Wolfi 15.2.0-r3)
built with OpenSSL 3.6.0 1 Oct 2025
TLS SNI support enabled -
etcd version, if relevant (run
curl http://127.0.0.1:9090/v1/server_info): -
APISIX Dashboard version, if relevant:
-
Plugin runner version, for issues related to plugin runners:
-
LuaRocks version, for installation issues (run
luarocks --version):
Metadata
Metadata
Assignees
Labels
Type
Projects
Status