Skip to content

bug: Discrepancy Between lsof Count and apisix_nginx_http_current_connections{state="active"} Metric #12767

@hebbaa

Description

@hebbaa

Current Behavior

Issue Description

We are observing a significant discrepancy within a single Apache APISIX pod where the aggregated Prometheus metric for active connections (apisix_nginx_http_current_connections{state="active"}) reports a much higher number than the actual number of open TCP sockets reported by the operating system's lsof command for all Nginx processes within that pod.
This over-reporting seems related to Nginx worker process recycling, suggesting a flaw in how the nginx-lua-prometheus module or APISIX aggregation handles statistics from older, draining worker processes.

The investigation is based on the following specific data observed within a single APISIX Kubernetes pod:

1. System Metrics via curl (Prometheus Endpoint)
The endpoint reported these values for current connections:

apisix_nginx_http_current_connections{state="active"} 58193
apisix_nginx_http_current_connections{state="accepted"} 234773641
apisix_nginx_http_current_connections{state="handled"} 234773641
apisix_nginx_http_current_connections{state="waiting"} 58076
apisix_nginx_http_current_connections{state="writing"} 89

2. Operating System File Descriptor Count (lsof)Running lsof across all Nginx PIDs within that same pod returned a significantly lower number:

lsof -iTCP -a -p $(pgrep nginx | tr '\n' ',') | wc -l
39385

3. Process List (ps -eaf and stat)
The process list revealed four total worker processes, with two running much longer than the others, indicating a configuration reload event occurred:

# Process list output subset
PID   USER     TIME  COMMAND
   51 apisix    1d06 {openresty} nginx: worker process  # Long running (~1 day CPU time)
   52 apisix    1d06 {openresty} nginx: worker process  # Long running (~1 day CPU time)
  114 apisix   33:48 {openresty} nginx: worker process  # Short running (~34 minutes CPU time)
  116 apisix   23:32 {openresty} nginx: worker process  # Short running (~24 minutes CPU time)

# Timestamp of new worker process 114 start time
stat /proc/114
Change: 2025-11-23 09:38:44.660489426 +0000 

4. System Configuration
Configured Limit: worker_connections 10620;
Total Expected Max Connections (Per Pod): 4 workers * 10620 connections/worker = 42,480 total connections.

Analysis of Observations and Inconsistencies

We have identified two primary inconsistencies based on this data:
Inconsistency 1: Metric Count vs. OS Count
The Nginx metric reports ~58,193 active connections.
The actual operating system reality (lsof) reports ~39,385 open TCP sockets (file descriptors).
The lsof count is the ground truth for actual resource consumption. The Nginx metric is overstating the active connection count by nearly 19,000 connections (roughly the number of active connections 2 worker process might hold).
Inconsistency 2: Active Connections vs. Configured Limit
The OS-level count of ~39k is actually within the configured maximum of 42,480. However, the Nginx metric reports ~58k, suggesting the system is vastly exceeding its limits, when in reality, it is not.
Conclusion on the Discrepancy
The most logical explanation for these observations is a bug in how the APISIX Prometheus metrics module aggregates statistics across different generations of Nginx worker processes within the same pod.
When Nginx reloads gracefully (starting PIDs 114 and 116 while PIDs 51 and 52 drain), the metrics module seems to be incorrectly summing the statistics of both the old, draining workers and the new, active workers simultaneously, resulting in a misleadingly high "active" count that doesn't reflect the actual consumed file descriptors shown by lsof.
The true number of active connections for this pod is closer to 39,385.

Expected Behavior

The true number of active connections for this pod is closer to 39,385.

Error Logs

No response

Steps to Reproduce

Observed in production deployment

Environment

  • APISIX version (run apisix version):3.14.1

  • Operating system (run uname -a):Linux data-plane-7c4b955754-dp552 5.10.230-223.885.amzn2.x86_64 change: added doc of how to load plugin. #1 SMP Tue Dec 3 14:36:00 UTC 2024 x86_64 GNU/Linux

  • OpenResty / Nginx version (run openresty -V or nginx -V):nginx version: openresty/1.27.1.2 (x86_64-pc-linux-gnu)
    built by gcc 15.2.0 (Wolfi 15.2.0-r3)
    built with OpenSSL 3.6.0 1 Oct 2025
    TLS SNI support enabled

  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):

  • APISIX Dashboard version, if relevant:

  • Plugin runner version, for issues related to plugin runners:

  • LuaRocks version, for installation issues (run luarocks --version):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions