You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fluent-bit somehow stops reading data and randomly happens on our machines. Once it runs into this case, fluent-bit gets stuck forever. Restarting the process doesn't help, as it would get stuck as well after starting.
The pattern we've observed is before it gets stuck, there are always consecutive logs about task creating as follows:
[2024/05/10 06:03:24] [trace] [task 0x7f7717e397a0] created (id=83)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f7717e397a0 id=83 OK
[2024/05/10 06:03:24] [trace] [task 0x7f7717e39840] created (id=84)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f7717e39840 id=84 OK
[2024/05/10 06:03:24] [trace] [task 0x7f7717e39a20] created (id=85)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f7717e39a20 id=85 OK
[2024/05/10 06:03:24] [trace] [task 0x7f770a033240] created (id=86)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f770a033240 id=86 OK
[2024/05/10 06:03:24] [trace] [task 0x7f770a0332e0] created (id=301)
[2024/05/10 06:03:24] [debug] [task] created task=0x7f770a0332e0 id=301 OK
...
Sending SIGCONT to fluent-bit, the dump showed it didn't reach the mem limit, and every chunks of tail were busy. Any attempt to get the dump report after stuck resulted the identical report.
The only working fix is to delete SQLite DB files (.db, .db-shm, and .db-wal), and then to restart fluent-bit.
We firstly got this issue in 1.9.9. Upgrading to 3.0.3 didn't help on this.
And the configuration snippet of fluent-bit is as follows:
[SERVICE]
Flush 1
Daemon Off
Log_Level trace
Parsers_File parsers.conf
storage.path /var/log/flb-storage/
storage.sync normal
storage.checksum off
storage.backlog.mem_limit 1000MB
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_PORT 2020
Hot_Reload On
[INPUT]
Name tail
Tag_Regex (?<pod_name>[a-z0-9](?:[-a-z0-9.]*[a-z0-9])?(?:\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*)_(?<namespace_name>[^_]+)_(?<container_name>.+)-(?<docker_id>[a-z0-9]{64})\.log$
Tag kube.<namespace_name>@@@@@@<container_name>@@@@@@<pod_name>@@@@@@<docker_id>
Path /var/log/containers/*.log
Parser cri
DB /var/log/flb_kube.db
Mem_Buf_Limit 2048MB
Any advice on troubleshooting the issue?
Your Environment
Version used: 3.0.3, 1.9.9
Environment name and version:
fluent-bit 3.0.3 Debian package (from packages.fluentbit.io/debian) installed on the base image of docker.io/debian:bullseye, run as a container in Kubernetes cluster
It's deployed as a daemonset in kubernetes, mounting the host path: /var/log
/var/log is located in the device containing host's root partition, not networking file system.
Operating System and version:
A VM in Alibaba, running Alibaba Cloud Linux release 3 (Soaring Falcon), Linux tri401 5.10.134-16.1.al8.x86_64 # 1 SMP Thu Dec 7 14:11:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered:
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.
Issue Report
Describe the issue
fluent-bit somehow stops reading data and randomly happens on our machines. Once it runs into this case, fluent-bit gets stuck forever. Restarting the process doesn't help, as it would get stuck as well after starting.
The pattern we've observed is before it gets stuck, there are always consecutive logs about task creating as follows:
Sending SIGCONT to fluent-bit, the dump showed it didn't reach the mem limit, and every chunks of tail were busy. Any attempt to get the dump report after stuck resulted the identical report.
The only working fix is to delete SQLite DB files (.db, .db-shm, and .db-wal), and then to restart fluent-bit.
We firstly got this issue in 1.9.9. Upgrading to 3.0.3 didn't help on this.
The logs in trace level are attached as follows:
clean-faillog.txt
And the configuration snippet of fluent-bit is as follows:
Any advice on troubleshooting the issue?
Your Environment
Version used: 3.0.3, 1.9.9
Environment name and version:
fluent-bit 3.0.3 Debian package (from packages.fluentbit.io/debian) installed on the base image of docker.io/debian:bullseye, run as a container in Kubernetes cluster
It's deployed as a daemonset in kubernetes, mounting the host path: /var/log
/var/log is located in the device containing host's root partition, not networking file system.
Operating System and version:
A VM in Alibaba, running Alibaba Cloud Linux release 3 (Soaring Falcon), Linux tri401 5.10.134-16.1.al8.x86_64 # 1 SMP Thu Dec 7 14:11:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
The text was updated successfully, but these errors were encountered: