-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filebeat (practically) hangs after restart on machine with a lot of logs #16076
Comments
I took a look at sources trying to refresh my memory about log entries (actual logs unfortunately rotated after I reset). IIRC I saw in huge numbers snippets like
(and again and again) and later slowly it started to emit also
|
Uuuuuuppsss. This may be 7.5.1 problem, not configuration change problem. I just restarted one of those filebeats, without changing anything, and looks like I have problem back. Well, we are after hours so I may try verifying debug logs more calmly. |
OK, so this is typical snippet I see repeating ad infinitum :
Timing here (~0.1s per such section) is rather happy, later I see it slowing down a bit. Still, even if it kept 10files/s ratio, with my 70.000 files on the machine I don't expect this restart to finish soon. Nothing seems going upstream. |
I tried downgrading to 7.4.1 and 7.1.1 and looks like I still have a problem. Mayhaps I am unlucky having particularly many files today (some app crashed a lot and generated many of them) but IIRC I happend to restart some 7.* version without issues (surely I did some upgrades). But it was with previous configuration. Going to try bisecting the latter a bit. |
OK, I am not sure whether this is the only factor, but it looks like I have particularly many files today (due to crashes of some app), and this is the main reason for the problems – not my config change (which triggered the problem because I restarted filebeat to apply it) and not filebeat version. I accept that it takes time to process them, but it is problematic that filebeat doesn't start handling other files and even other inputs in such a case (no harvesters start, at least for very noticeable time). Sole cost seems also somewhat problematic, 0.1-0.2s for single file, with files processed linearly, is a lot. This doesn't seem directly related to the filesystem cost, directory uses ext4 dir_index. As filebeat has also high CPU usage in this phase, and as I straced it to repeatably open an write |
I'm running into the same issue on 7.4.0 and 7.5.2. I have about 500 thousand files about 1mb large each. Filebeat can take days to start outputting events to Logstash after a restart. My registry file is about 20 Mb large. Some things I've noticed:
CPU usage by filebeat is also suspiciously high. I'm not sure what the bottle neck would be. Potentially it's the CPU on my machine:
Things I've Tried:
|
One more thing to add: My log path has two wildcard in it. Mainly because I'm parsing a folder of n subfolders of n text files. So:
|
Unfortunately, the registry file has its limitations in Filebeat. It is just a plain text file with JSON data in it. So everytime the state needs to be updated (e.g. events were ACKed or dropped), the complete registry is flushed to disk. The flush consist of the following steps:
This is a slow process and it can be subject to backpressure. There are a few workarounds:
Also, next time please post your question on Discuss. Github issues are reserved for verified bugs. |
Well. You have fairly simple test case: start filebeat, create 50.000 files in log directory, call I'd say the current registry design is buggy, at least in cases where it's possible to have many log files. It's likely that switching from single json into something smarter, what wouldn't require rewriting whole registry after every operation (.sqlite for example, or even state batched into many files with - say - 100 entries per single file) would resolve the issue. I am going to look at flush settings. |
There is an ongoing effort to rewrite the registry. Hopefully, when it is merged, your situtation will improve: #12908 |
It preliminarily seems that
mitigates the problem, although I am yet to see behaviour in the „worst” cases. It seems to me that current default of 0 is - considering current costly implementation - wrong, and it would be more sensible to default to sth else. Or, at the very least, I'd suggest that PS #12908 makes me afraid for now („All in all there is a performance degragation due to this change. This becomes most noticable in increased CPU usage.”). I understand this is temporary step, but maybe during this temporary state suggestions above make even more sense? |
I'm still seeing this same issue after making the tweaks suggested above. I've opened this topic in Discuss to continue the topic. Thanks |
I have the same issue and it won't get fixed until I remove registry, which is undesired because leads to lots of duplications.
|
Hi, I have filebeat-7.12.1 on windows-2016, which is trying to process 411336 items "D:\Logs\ProjectA-METRICS***.json" with varied file *.json sizes. Max 280 MB and Min 7.5KB and writing to AWS OpenSearch Index. 2022-07-18T09:05:33.324+0530 INFO instance/beat.go:660 Home path: [C:\Program Files\Elastic\Beats\7.12.1\filebeat] Config path: [C:\ProgramData\Elastic\Beats\filebeat] Data path: [C:\ProgramData\Elastic\Beats\filebeat\data] Logs path: [C:\ProgramData\Elastic\Beats\filebeat\logs] I am using filebeat input section below, tried different values and getting the panic error again and again. filebeat.inputs:
=== Error==== goroutine 130 [running]: |
Reopening as there seems to be issues still in Filebeat 8.3 https://discuss.elastic.co/t/fillebeat-keeps-restarting-in-kuberneates/309702 |
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
I experienced a very similar issue with filebeat v8.8.2 making little to no progress reading and shipping log files. Using Per @kvch's comment I added For reference, the filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
add_resource_metadata:
namespace:
enabled: false
deployment: false
cronjob: false
templates:
- condition:
and:
- regexp:
kubernetes.namespace: "^some-prefix.*"
- not:
equals:
kubernetes.labels.some-label: "some-value"
config:
- type: container
paths:
- /var/log/containers/*${data.kubernetes.container.id}.log
tail_files: true
#filebeat.registry.flush: 60s # adding this fixed throughput
output.elasticsearch:
index: default
hosts: ["http://localhost:9200"]
allow_older_versions: true
setup:
ilm.enabled: false
template.enabled: false |
I'm also experiencing this in 8.13.1 with only ~17k files created hourly. |
Sorry for somewhat vague problem description. I was working on the problem in panic mode and had no time to preserve details like exact log snippets.
I use filebeat to scrap logs from machines where i have PLENTY of log files. Numbers vary, but having a few thousands new files every day is normal and sth like 50 thousands happens (in general: a lot of small apps running, each app opens new log on each restart and after each midnight, old logs are moved out by name, log names contain date…). After some initial tuning (harvester limit adapted to system capacity, time limits tuned to abandon inactive files at reasonable rate, etc) filebeat handles those setups without problems.
Well, handled until I reconfigured. Today I split some inputs into a few sections, more or less it was change from
into
where „other settings” were the same everywhere except I added some custom fields in each section (this was the reason for the change), and reduced harvester limits so they summed up to the original value.
After restart, filebeat practically stopped working. It either didn't forward anything, or handled only one section (depending on the machine, I saw the problem on a few). After I enabled debug logging it only logged myriads of notes about reading and writing data.json with some statistics („XYZV before, XYZV after”) and later started to put some notes about finding or adding state of individual files (~ 1 such note per second). On the machine I fixed last it remained in such a state for about 2 hours, without noticeable change.
I recovered from that by stopping filebeat, erasing
/var/lib/filebeat/registry
, and starting filebeat again – then it simply started to work and seems to work properly (of course it republished all old logs according to my age settings and introduced lag caused by old data processing).I suppose there is sth wrong in the way state is migrated/interpreted/checked after input sections change. This is somewhat unexpected, I didn't think that changing alocation of files to sections impacts state handling.
Tested on filebeat 7.5.1
PS My
/var/lib/filebeat/registry/data.json
have 3-8MB, if that is of any interest.The text was updated successfully, but these errors were encountered: