Skip to content

Collector issue regarding FDs not being closed when under load #77

@rrhubenov

Description

@rrhubenov

What is the problem in detail?
In certain clusters where a higher load of logs are generated (e.g. some seeds), the OpenTelemetry Collector instance that acts as a log shipper on the Shoot nodes enters a sort of deadlock state in its goroutines.

This causes it to stop answering to scrapes from Prometheus, although this is only one symptom that is more evident due to the alerts.
A bigger issue is that the collector starts failing to send logs to the control plane.

The cause for this happening is still unknown, but the following patterns have been identified:

The issue happens only on clusters that appear to be bigger (which means increased load of logs)
When the issue begins, the collector stops closing FDs of files that have already been deleted (e.g. during log rotation):

# lsof -p 1510580   | awk 'NR < 2 || /deleted/'
 COMMAND       PID USER   FD      TYPE             DEVICE  SIZE/OFF      NODE NAME
 opentelem 1510580 root    7r      REG              259,3 107330861   1180216 /var/log/pods/kube-system_calico-node-2c642_2ba9fe21-cbf8-477f-8aaf-6e615ec2f8e8/calico-node/0.log.20260305-174620 (deleted)
 opentelem 1510580 root   16r      REG              259,3     22246   1312585 /var/log/pods/kube-system_egress-filter-applier-54cww_03314a85-89d6-4393-89fc-20a4164b8d97/egress-filter-applier/0.log (deleted)

There is a big number of CLOSE_WAIT connections. 1 per every scrape that Prometheus attempts. This can be easily seen with a tool that can list open connections (e.g. ss).
Manually calling curl -v -X GET http://127.0.0.1:18888/metrics when ssh-ed into a problematic node results in the requests hanging and another CLOSE_WAIT connection being opened.
Closing the connections with ss --tcp state CLOSE-WAIT --kill does not resolve the issue
Restarting the collector fixes the issue (at least temporarily)
Sadly, we haven't succeeded in reproducing the issue manually yet. This makes debugging and fixing the issue harder.
One possible temporary solution is to build a check into the systemd units of the collector that will call the /metrics endpoint and restart the unit if timing out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions