Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filebeat (practically) hangs after restart on machine with a lot of logs #16076

Open
Mekk opened this issue Feb 4, 2020 · 18 comments
Open

filebeat (practically) hangs after restart on machine with a lot of logs #16076

Mekk opened this issue Feb 4, 2020 · 18 comments
Labels
Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@Mekk
Copy link

Mekk commented Feb 4, 2020

Sorry for somewhat vague problem description. I was working on the problem in panic mode and had no time to preserve details like exact log snippets.

I use filebeat to scrap logs from machines where i have PLENTY of log files. Numbers vary, but having a few thousands new files every day is normal and sth like 50 thousands happens (in general: a lot of small apps running, each app opens new log on each restart and after each midnight, old logs are moved out by name, log names contain date…). After some initial tuning (harvester limit adapted to system capacity, time limits tuned to abandon inactive files at reasonable rate, etc) filebeat handles those setups without problems.

Well, handled until I reconfigured. Today I split some inputs into a few sections, more or less it was change from

- type: log
  enabled: true
  paths:
    - /some/path/*.log
    - /other/path/*.log
    - …
  # and other settings

into

- type: log
  enabled: true
  paths:
    - /some/path/*.log
  # and other settings
- type: log
  enabled: true
  paths:
    - /other/path/*.log
  # and other settings
- type: log
  enabled: true
  paths:
    - …remaining …
  # and other settings

where „other settings” were the same everywhere except I added some custom fields in each section (this was the reason for the change), and reduced harvester limits so they summed up to the original value.

After restart, filebeat practically stopped working. It either didn't forward anything, or handled only one section (depending on the machine, I saw the problem on a few). After I enabled debug logging it only logged myriads of notes about reading and writing data.json with some statistics („XYZV before, XYZV after”) and later started to put some notes about finding or adding state of individual files (~ 1 such note per second). On the machine I fixed last it remained in such a state for about 2 hours, without noticeable change.

I recovered from that by stopping filebeat, erasing /var/lib/filebeat/registry, and starting filebeat again – then it simply started to work and seems to work properly (of course it republished all old logs according to my age settings and introduced lag caused by old data processing).

I suppose there is sth wrong in the way state is migrated/interpreted/checked after input sections change. This is somewhat unexpected, I didn't think that changing alocation of files to sections impacts state handling.

Tested on filebeat 7.5.1

PS My /var/lib/filebeat/registry/data.json have 3-8MB, if that is of any interest.

@Mekk
Copy link
Author

Mekk commented Feb 4, 2020

I took a look at sources trying to refresh my memory about log entries (actual logs unfortunately rotated after I reset). IIRC I saw in huge numbers snippets like

… sth about reading registry …
Registrar states cleaned up. Before: 3176, After: 3716, Pending: 0
Writing registry file: …

(and again and again) and later slowly it started to emit also

New state added for /some/specific/log/name/different/every/time

@kaiyan-sheng kaiyan-sheng added bug Filebeat Filebeat Team:Beats Team:Services (Deprecated) Label for the former Integrations-Services team labels Feb 4, 2020
@Mekk
Copy link
Author

Mekk commented Feb 4, 2020

Uuuuuuppsss. This may be 7.5.1 problem, not configuration change problem. I just restarted one of those filebeats, without changing anything, and looks like I have problem back.

Well, we are after hours so I may try verifying debug logs more calmly.

@Mekk
Copy link
Author

Mekk commented Feb 4, 2020

OK, so this is typical snippet I see repeating ad infinitum :

2020-02-04T20:51:42.595+0100	DEBUG	[input]	file/states.go:68	New state added for /home/ids_backend_api/APP/log/adb_processor_adm_20200202_021116_18336_pqt-ids-deb8.log
2020-02-04T20:51:42.596+0100	DEBUG	[registrar]	registrar/registrar.go:346	Registrar states cleaned up. Before: 32448, After: 32448, Pending: 0
2020-02-04T20:51:42.598+0100	DEBUG	[registrar]	registrar/registrar.go:411	Write registry file: /var/lib/filebeat/registry/filebeat/data.json (32448)
2020-02-04T20:51:42.678+0100	DEBUG	[registrar]	registrar/registrar.go:404	Registry file updated. 32448 states written.
2020-02-04T20:51:42.678+0100	DEBUG	[registrar]	registrar/registrar.go:356	Processing 1 events
2020-02-04T20:51:42.678+0100	DEBUG	[registrar]	registrar/registrar.go:326	Registrar state updates processed. Count: 1
2020-02-04T20:51:42.679+0100	DEBUG	[acker]	beater/acker.go:64	stateful ack	{"count": 1}
2020-02-04T20:51:42.679+0100	DEBUG	[publisher]	pipeline/client.go:220	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:1855605-65026 Finished:true Fileinfo:<nil> Source:/home/ids_backend_api/APP/log/adb_processor_adm_20200202_021116_18336_pqt-ids-deb8.log Offset:160240 Timestamp:2020-02-04 20:47:25.96242759 +0100 CET TTL:-1ns Type:log Meta:map[] FileStateOS:1855605-65026} TimeSeries:false}
2020-02-04T20:51:42.679+0100	DEBUG	[input]	file/states.go:68	New state added for /home/ids_backend_api/APP/log/adb_processor_adm_20200203_003539_21527_pqt-ids-deb8.log
2020-02-04T20:51:42.679+0100	DEBUG	[registrar]	registrar/registrar.go:346	Registrar states cleaned up. Before: 32448, After: 32448, Pending: 0
2020-02-04T20:51:42.681+0100	DEBUG	[registrar]	registrar/registrar.go:411	Write registry file: /var/lib/filebeat/registry/filebeat/data.json (32448)
2020-02-04T20:51:42.763+0100	DEBUG	[registrar]	registrar/registrar.go:404	Registry file updated. 32448 states written.
2020-02-04T20:51:42.763+0100	DEBUG	[registrar]	registrar/registrar.go:356	Processing 1 events
2020-02-04T20:51:42.763+0100	DEBUG	[registrar]	registrar/registrar.go:326	Registrar state updates processed. Count: 1
2020-02-04T20:51:42.764+0100	DEBUG	[registrar]	registrar/registrar.go:346	Registrar states cleaned up. Before: 32448, After: 32448, Pending: 0
2020-02-04T20:51:42.767+0100	DEBUG	[acker]	beater/acker.go:64	stateful ack	{"count": 1}
2020-02-04T20:51:42.772+0100	DEBUG	[registrar]	registrar/registrar.go:411	Write registry file: /var/lib/filebeat/registry/filebeat/data.json (32448)
2020-02-04T20:51:42.772+0100	DEBUG	[publisher]	pipeline/client.go:220	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:1894543-65026 Finished:true Fileinfo:<nil> Source:/home/ids_backend_api/APP/log/adb_processor_adm_20200203_003539_21527_pqt-ids-deb8.log Offset:161959 Timestamp:2020-02-04 20:47:26.0981662 +0100 CET TTL:-1ns Type:log Meta:map[] FileStateOS:1894543-65026} TimeSeries:false}

Timing here (~0.1s per such section) is rather happy, later I see it slowing down a bit. Still, even if it kept 10files/s ratio, with my 70.000 files on the machine I don't expect this restart to finish soon.

Nothing seems going upstream.

@Mekk
Copy link
Author

Mekk commented Feb 4, 2020

I tried downgrading to 7.4.1 and 7.1.1 and looks like I still have a problem. Mayhaps I am unlucky having particularly many files today (some app crashed a lot and generated many of them) but IIRC I happend to restart some 7.* version without issues (surely I did some upgrades). But it was with previous configuration.

Going to try bisecting the latter a bit.

@Mekk
Copy link
Author

Mekk commented Feb 4, 2020

OK, I am not sure whether this is the only factor, but it looks like I have particularly many files today (due to crashes of some app), and this is the main reason for the problems – not my config change (which triggered the problem because I restarted filebeat to apply it) and not filebeat version.

I accept that it takes time to process them, but it is problematic that filebeat doesn't start handling other files and even other inputs in such a case (no harvesters start, at least for very noticeable time).

Sole cost seems also somewhat problematic, 0.1-0.2s for single file, with files processed linearly, is a lot. This doesn't seem directly related to the filesystem cost, directory uses ext4 dir_index. As filebeat has also high CPU usage in this phase, and as I straced it to repeatably open an write data.json, it seems to me that the slowdown may be caused by repeatable read/parse/update/serialize/write cycle on data.json. In case of multiple inputs there also can be some contention for locks on data.json or sth similar. So my wild bet is that huge improvement could be obtained by checkpointing data.json from time to time instead of doing it after every single file what seems to have place.

@Mekk Mekk changed the title filebeat (practically) hangs after reconfiguration on machine with a lot of logs (lost state??) filebeat (practically) hangs after restart on machine with a lot of logs Feb 4, 2020
@serant
Copy link

serant commented Feb 7, 2020

I'm running into the same issue on 7.4.0 and 7.5.2. I have about 500 thousand files about 1mb large each. Filebeat can take days to start outputting events to Logstash after a restart. My registry file is about 20 Mb large.

Some things I've noticed:

  • If I delete my registry file, Filebeat will start sending events immediately
  • The amount of time it takes for Filebeat to start sending events seems to grow as the size of my registry file increased. For example, when my registry file was only 200 Kb the restart would take a few minutes.
  • When looking at usage statistics, the number of open files by the harvester seems to stay at 0 for days (I see this increase to the number of files I'm ingesting once filebeat is outputting events)
"harvester": {
    "open_files": 0,
    "running": 0
}

CPU usage by filebeat is also suspiciously high. I'm not sure what the bottle neck would be. Potentially it's the CPU on my machine:

  • Elasticsearch/logstash has tons of headroom (even when filebeat is sending events before the restart is issued)
  • Not hitting RAM limits on the machine (8 GB is still free)
  • Filebeat is using 100% of the CPU on one of the cores of the machine
  • Disk IO is minimal (about 11% usage)

Things I've Tried:

  • Purging my filebeat registry file (filebeat starts sending events quickly afterwards but then I have to re-send everything)
  • Tuning my filebeat inputs. This should only leave about 100 files being active at once.
- type: log
  harvester_limit: 512

  enabled: true
  scan_frequency: 120
  close_inactive: 1m
  clean_inactive: 750h
  clean_removed: true

@serant
Copy link

serant commented Feb 7, 2020

One more thing to add: My log path has two wildcard in it. Mainly because I'm parsing a folder of n subfolders of n text files. So:

paths:
  - /path_to_subfolders/*/*.txt

@kvch
Copy link
Contributor

kvch commented Feb 11, 2020

Unfortunately, the registry file has its limitations in Filebeat. It is just a plain text file with JSON data in it. So everytime the state needs to be updated (e.g. events were ACKed or dropped), the complete registry is flushed to disk. The flush consist of the following steps:

  • creating new temporary file
  • serialize millions of entries to JSON
  • fsync the file
  • close the FD
  • move the file
  • fsync the folder where the registry is located

This is a slow process and it can be subject to backpressure.

There are a few workarounds:

Also, next time please post your question on Discuss. Github issues are reserved for verified bugs.

@Mekk
Copy link
Author

Mekk commented Feb 13, 2020

Well. You have fairly simple test case: start filebeat, create 50.000 files in log directory, call filebeat restart. After restart filebeat will spend hours doing 100% CPU before it starts actually doing anything.

I'd say the current registry design is buggy, at least in cases where it's possible to have many log files. It's likely that switching from single json into something smarter, what wouldn't require rewriting whole registry after every operation (.sqlite for example, or even state batched into many files with - say - 100 entries per single file) would resolve the issue.

I am going to look at flush settings.

@kvch
Copy link
Contributor

kvch commented Feb 13, 2020

There is an ongoing effort to rewrite the registry. Hopefully, when it is merged, your situtation will improve: #12908

@Mekk
Copy link
Author

Mekk commented Feb 14, 2020

It preliminarily seems that

filebeat.registry.flush: 60s

mitigates the problem, although I am yet to see behaviour in the „worst” cases.

It seems to me that current default of 0 is - considering current costly implementation - wrong, and it would be more sensible to default to sth else.

Or, at the very least, I'd suggest that /etc/filebeat/filebeat.yml from packaged filebeats contained this setting (even if commented-out) with some remark about it's big performance impact.

PS #12908 makes me afraid for now („All in all there is a performance degragation due to this change. This becomes most noticable in increased CPU usage.”). I understand this is temporary step, but maybe during this temporary state suggestions above make even more sense?

@andresrc andresrc added the Team:Integrations Label for the Integrations team label Mar 6, 2020
@serant
Copy link

serant commented Jul 10, 2020

I'm still seeing this same issue after making the tweaks suggested above. I've opened this topic in Discuss to continue the topic.
https://discuss.elastic.co/t/filebeat-extremely-slow-startup-for-large-number-of-files/240746

Thanks

@SudoNova
Copy link

I have the same issue and it won't get fixed until I remove registry, which is undesired because leads to lots of duplications.
A small portion of log file is as follows:

2021-01-31T05:03:36.641+0330	INFO	instance/beat.go:455	filebeat start running.
2021-01-31T05:03:36.641+0330	INFO	[monitoring]	log/log.go:118	Starting metrics logging every 30s
2021-01-31T05:03:36.641+0330	DEBUG	[service]	service/service_windows.go:93	Windows is interactive: true
2021-01-31T05:03:36.641+0330	DEBUG	[test]	registrar/migrate.go:304	isFile(C:\ProgramData\Elastic\Beats\filebeat\data\registry) -> false
2021-01-31T05:03:36.641+0330	DEBUG	[test]	registrar/migrate.go:304	isFile() -> false
2021-01-31T05:03:36.641+0330	DEBUG	[test]	registrar/migrate.go:297	isDir(C:\ProgramData\Elastic\Beats\filebeat\data\registry\filebeat) -> true
2021-01-31T05:03:36.641+0330	DEBUG	[test]	registrar/migrate.go:304	isFile(C:\ProgramData\Elastic\Beats\filebeat\data\registry\filebeat\meta.json) -> true
2021-01-31T05:03:36.642+0330	DEBUG	[registrar]	registrar/migrate.go:84	Registry type '1' found
2021-01-31T05:03:37.251+0330	INFO	memlog/store.go:119	Loading data file of 'C:\ProgramData\Elastic\Beats\filebeat\data\registry\filebeat' succeeded. Active transaction id=862418635
2021-01-31T05:03:38.325+0330	INFO	memlog/store.go:124	Finished loading transaction log file for 'C:\ProgramData\Elastic\Beats\filebeat\data\registry\filebeat'. Active transaction id=862438999
2021-01-31T05:03:38.326+0330	WARN	beater/filebeat.go:381	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2021-01-31T05:03:38.643+0330	INFO	[registrar]	registrar/registrar.go:109	States Loaded from registrar: 16798
2021-01-31T05:03:38.644+0330	INFO	[crawler]	beater/crawler.go:71	Loading Inputs: 10
2021-01-31T05:03:38.644+0330	DEBUG	[registrar]	registrar/registrar.go:140	Starting Registrar
2021-01-31T05:03:38.655+0330	WARN	[cfgwarn]	log/config.go:184	EXPERIMENTAL: scan_sort is used.
2021-01-31T05:03:38.655+0330	DEBUG	[input]	log/config.go:207	recursive glob enabled
2021-01-31T05:03:38.656+0330	WARN	[cfgwarn]	log/config.go:184	EXPERIMENTAL: scan_sort is used.
2021-01-31T05:03:38.656+0330	DEBUG	[input]	log/input.go:169	exclude_files: []. Number of states: 16798
2021-01-31T05:03:38.656+0330	DEBUG	[input]	file/states.go:68	New state added for C:\Logs\FileService\2021-01-26\13__.txt
2021-01-31T05:03:38.656+0330	DEBUG	[acker]	beater/acker.go:59	stateful ack	{"count": 1}
2021-01-31T05:03:38.656+0330	DEBUG	[registrar]	registrar/registrar.go:264	Processing 1 events
2021-01-31T05:03:38.657+0330	DEBUG	[registrar]	registrar/registrar.go:231	Registrar state updates processed. Count: 1
2021-01-31T05:03:38.656+0330	DEBUG	[publisher]	pipeline/client.go:231	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:native::4849664-152171-1524026271 PrevId: Finished:true Fileinfo:<nil> Source:C:\Logs\FileService\2021-01-26\13__.txt Offset:29948 Timestamp:2021-01-31 04:57:20.5605686 +0330 +0330 TTL:241h0m0s Type:log Meta:map[] FileStateOS:4849664-152171-1524026271 IdentifierName:native} TimeSeries:false}
2021-01-31T05:03:38.657+0330	DEBUG	[input]	file/states.go:68	New state added for C:\Logs\FileService\2021-01-27\18__.txt
2021-01-31T05:03:38.657+0330	DEBUG	[acker]	beater/acker.go:59	stateful ack	{"count": 1}
2021-01-31T05:03:38.657+0330	DEBUG	[publisher]	pipeline/client.go:231	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:native::5111808-156889-1524026271 PrevId: Finished:true Fileinfo:<nil> Source:C:\Logs\FileService\2021-01-27\18__.txt Offset:18932 Timestamp:2021-01-31 04:57:18.0947681 +0330 +0330 TTL:241h0m0s Type:log Meta:map[] FileStateOS:5111808-156889-1524026271 IdentifierName:native} TimeSeries:false}
2021-01-31T05:03:38.657+0330	DEBUG	[registrar]	registrar/registrar.go:254	Registrar states cleaned up. Before: 16798, After: 16798, Pending: 1
2021-01-31T05:03:38.658+0330	DEBUG	[input]	file/states.go:68	New state added for C:\Logs\FileService\2021-01-23\15__.txt
2021-01-31T05:03:38.658+0330	DEBUG	[acker]	beater/acker.go:59	stateful ack	{"count": 1}
2021-01-31T05:03:38.666+0330	DEBUG	[registrar]	registrar/registrar.go:201	Registry file updated. 16798 active states.
2021-01-31T05:03:39.981+0330	DEBUG	[registrar]	registrar/registrar.go:264	Processing 1 events
2021-01-31T05:03:39.981+0330	DEBUG	[registrar]	registrar/registrar.go:231	Registrar state updates processed. Count: 1
2021-01-31T05:03:39.981+0330	DEBUG	[publisher]	pipeline/client.go:231	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:native::13172736-514469-1524026271 PrevId: Finished:true Fileinfo:<nil> Source:C:\Logs\FileService\2021-01-23\15__.txt Offset:18832 Timestamp:2021-01-31 04:57:07.2302427 +0330 +0330 TTL:241h0m0s Type:log Meta:map[] FileStateOS:13172736-514469-1524026271 IdentifierName:native} TimeSeries:false}
2021-01-31T05:03:39.981+0330	DEBUG	[registrar]	registrar/registrar.go:254	Registrar states cleaned up. Before: 16798, After: 16798, Pending: 2
2021-01-31T05:03:39.983+0330	DEBUG	[registrar]	registrar/registrar.go:201	Registry file updated. 16798 active states.
2021-01-31T05:03:39.983+0330	DEBUG	[input]	file/states.go:68	New state added for C:\Logs\FileService\2021-01-24\13__.txt
2021-01-31T05:03:39.983+0330	DEBUG	[acker]	beater/acker.go:59	stateful ack	{"count": 1}
2021-01-31T05:03:40.852+0330	DEBUG	[registrar]	registrar/registrar.go:264	Processing 1 events
2021-01-31T05:03:40.852+0330	DEBUG	[registrar]	registrar/registrar.go:231	Registrar state updates processed. Count: 1
2021-01-31T05:03:40.853+0330	DEBUG	[registrar]	registrar/registrar.go:254	Registrar states cleaned up. Before: 16798, After: 16798, Pending: 3
2021-01-31T05:03:40.858+0330	DEBUG	[publisher]	pipeline/client.go:231	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:native::1966080-516576-1524026271 PrevId: Finished:true Fileinfo:<nil> Source:C:\Logs\FileService\2021-01-24\13__.txt Offset:18732 Timestamp:2021-01-31 04:57:04.8268518 +0330 +0330 TTL:241h0m0s Type:log Meta:map[] FileStateOS:1966080-516576-1524026271 IdentifierName:native} TimeSeries:false}
2021-01-31T05:03:40.860+0330	DEBUG	[input]	file/states.go:68	New state added for C:\Logs\FileService\2021-01-22\20__.txt
2021-01-31T05:03:40.860+0330	DEBUG	[acker]	beater/acker.go:59	stateful ack	{"count": 1}
2021-01-31T05:03:40.861+0330	DEBUG	[registrar]	registrar/registrar.go:201	Registry file updated. 16798 active states.
2021-01-31T05:03:42.192+0330	DEBUG	[registrar]	registrar/registrar.go:264	Processing 1 events
2021-01-31T05:03:42.192+0330	DEBUG	[registrar]	registrar/registrar.go:231	Registrar state updates processed. Count: 1
2021-01-31T05:03:42.193+0330	DEBUG	[registrar]	registrar/registrar.go:254	Registrar states cleaned up. Before: 16798, After: 16798, Pending: 4
2021-01-31T05:03:42.194+0330	DEBUG	[registrar]	registrar/registrar.go:201	Registry file updated. 16798 active states.
2021-01-31T05:03:42.424+0330	DEBUG	[publisher]	pipeline/client.go:231	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:native::18350080-512223-1524026271 PrevId: Finished:true Fileinfo:<nil> Source:C:\Logs\FileService\2021-01-22\20__.txt Offset:21908 Timestamp:2021-01-31 04:56:49.6360456 +0330 +0330 TTL:241h0m0s Type:log Meta:map[] FileStateOS:18350080-512223-1524026271 IdentifierName:native} TimeSeries:false}
2021-01-31T05:03:42.430+0330	DEBUG	[input]	file/states.go:68	New state added for C:\Logs\FileService\2021-01-26\15__.txt
2021-01-31T05:03:42.430+0330	DEBUG	[acker]	beater/acker.go:59	stateful ack	{"count": 1}
2021-01-31T05:03:43.634+0330	DEBUG	[registrar]	registrar/registrar.go:264	Processing 1 events
2021-01-31T05:03:43.634+0330	DEBUG	[registrar]	registrar/registrar.go:231	Registrar state updates processed. Count: 1
2021-01-31T05:03:43.634+0330	DEBUG	[registrar]	registrar/registrar.go:254	Registrar states cleaned up. Before: 16798, After: 16798, Pending: 5
2021-01-31T05:03:43.650+0330	DEBUG	[registrar]	registrar/registrar.go:201	Registry file updated. 16798 active states.
2021-01-31T05:03:43.691+0330	DEBUG	[publisher]	pipeline/client.go:231	Pipeline client receives callback 'onFilteredOut' for event: {Timestamp:0001-01-01 00:00:00 +0000 UTC Meta:null Fields:null Private:{Id:native::25886720-152248-1524026271 PrevId: Finished:true Fileinfo:<nil> Source:C:\Logs\FileService\2021-01-26\15__.txt Offset:16864 Timestamp:2021-01-31 04:56:54.2161996 +0330 +0330 TTL:241h0m0s Type:log Meta:map[] FileStateOS:25886720-152248-1524026271 IdentifierName:native} TimeSeries:false}

@adapasuresh
Copy link

adapasuresh commented Jul 18, 2022

Hi,

I have filebeat-7.12.1 on windows-2016, which is trying to process 411336 items "D:\Logs\ProjectA-METRICS***.json" with varied file *.json sizes. Max 280 MB and Min 7.5KB and writing to AWS OpenSearch Index.

2022-07-18T09:05:33.324+0530 INFO instance/beat.go:660 Home path: [C:\Program Files\Elastic\Beats\7.12.1\filebeat] Config path: [C:\ProgramData\Elastic\Beats\filebeat] Data path: [C:\ProgramData\Elastic\Beats\filebeat\data] Logs path: [C:\ProgramData\Elastic\Beats\filebeat\logs]
2022-07-18T09:05:33.338+0530 DEBUG [beat] instance/beat.go:712 Beat metadata path: C:\ProgramData\Elastic\Beats\filebeat\data\meta.json
2022-07-18T09:05:33.339+0530 INFO instance/beat.go:668 Beat ID: 5a3fd1ad-71a0-428d-9341-9b93aa45eb4e
2022-07-18T09:05:33.339+0530 DEBUG [seccomp] seccomp/seccomp.go:96 Syscall filtering is only supported on Linux
2022-07-18T09:05:33.339+0530 INFO [beat] instance/beat.go:996 Beat info {"system_info": {"beat": {"path": {"config": "C:\ProgramData\Elastic\Beats\filebeat", "data": "C:\ProgramData\Elastic\Beats\filebeat\data", "home": "C:\Program Files\Elastic\Beats\7.12.1\filebeat", "logs": "C:\ProgramData\Elastic\Beats\filebeat\logs"}, "type": "filebeat", "uuid": "5a3fd1ad-71a0-428d-9341-9b93aa45eb4e"}}}
2022-07-18T09:05:33.340+0530 INFO [beat] instance/beat.go:1005 Build info {"system_info": {"build": {"commit": "651a2ad1225f3d4420a22eba847de385b71f711d", "libbeat": "7.12.1", "time": "2021-04-20T19:58:27.000Z", "version": "7.12.1"}}}
2022-07-18T09:05:33.340+0530 INFO [beat] instance/beat.go:1008 Go runtime info {"system_info": {"go": {"os":"windows","arch":"amd64","max_procs":10,"version":"go1.15.9"}}}
2022-07-18T09:05:33.342+0530 INFO [beat] instance/beat.go:1012 Host info {"system_info": {"host": {"architecture":"x86_64","boot_time":"2022-07-06T00:38:55.48+05:30","name":"INTAWPGTUI051","ip":["10.168.148.50/24","::1/128","127.0.0.1/8"],"kernel_version":"10.0.14393.4467 (rs1_release.210604-1844)","mac":["00:50:56:87:ad:d8"],"os":{"type":"windows","family":"windows","platform":"windows","name":"Windows Server 2016 Standard","version":"10.0","major":10,"minor":0,"patch":0,"build":"14393.4467"},"timezone":"IST","timezone_offset_sec":19800,"id":"a4f380e9-055f-4b4c-bdc9-b1824c2dcabf"}}}
2022-07-18T09:05:33.343+0530 INFO [beat] instance/beat.go:1041 Process info {"system_info": {"process": {"cwd": "C:\Windows\system32", "exe": "C:\Program Files\Elastic\Beats\7.12.1\filebeat\filebeat.exe", "name": "filebeat.exe", "pid": 20412, "ppid": 724, "start_time": "2022-07-18T09:05:33.257+0530"}}}
2022-07-18T09:05:33.343+0530 INFO instance/beat.go:304 Setup Beat: filebeat; Version: 7.12.1
2022-07-18T09:05:33.343+0530 DEBUG [beat] instance/beat.go:330 Initializing output plugins
2022-07-18T09:05:33.343+0530 INFO eslegclient/connection.go:99 elasticsearch url: https://elkprod132-.ap-south-1.es.amazonaws.com:443
2022-07-18T09:05:33.343+0530 DEBUG [publisher] pipeline/consumer.go:148 start pipeline event consumer

I am using filebeat input section below, tried different values and getting the panic error again and again.

filebeat.inputs:

  • type: filestream
    harvester_limit: 512
    enabled: true
    close_inactive: 1m
    clean_inactive: 4h
    clean_removed: true
    paths:
    • D:\Logs\XXX_CONTROLLER_XXX_METRICS***.json
      scan_frequency: 120
      json.keys_under_root: true
      json.add_error_key: true
      processors:
    • decode_json_fields:
      fields:
      - message
      - correlationId
      - Exception
      process_array: false
      max_depth: 10
      target: ''
      overwrite_keys: true

=== Error====
panic: reflect.Select: too many cases (max 65536)

goroutine 130 [running]:
reflect.Select(0xc36750a000, 0x10001, 0x1036d, 0x1, 0x1e373c0, 0xc02d19b0e0, 0x16, 0x9801)
/usr/local/go/src/reflect/value.go:2181 +0x9a9
github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*Pipeline).runSignalPropagation(0xc000736140)
/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/pipeline.go:365 +0x1d4
created by github.com/elastic/beats/v7/libbeat/publisher/pipeline.(*Pipeline).registerSignalPropagation.func1
/go/src/github.com/elastic/beats/libbeat/publisher/pipeline/pipeline.go:350 +0x85

@jsoriano
Copy link
Member

Reopening as there seems to be issues still in Filebeat 8.3 https://discuss.elastic.co/t/fillebeat-keeps-restarting-in-kuberneates/309702

@jsoriano jsoriano reopened this Aug 22, 2022
@jsoriano jsoriano added Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team and removed Team:Integrations Label for the Integrations team Team:Services (Deprecated) Label for the former Integrations-Services team labels Aug 22, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@jstangroome
Copy link

I experienced a very similar issue with filebeat v8.8.2 making little to no progress reading and shipping log files. Using lsof -o showed that the file descriptor offsets for the ~100 open files was not updating for at least 20 of the files despite the files being actively written (i.e. not closed due to inactivity). The worst case was a ~250MB web server container HTTP access log file of which filebeat had only read ~4MB and hadn't made any progress in about an hour.

Per @kvch's comment I added filebeat.registry.flush: 60s to the filebeat.yml and this appears to have resolved the issue for my systems.

For reference, the filebeat.yml in use is very lean:

filebeat.autodiscover:
  providers:
    - type: kubernetes
      node: ${NODE_NAME}
      add_resource_metadata:
        namespace:
          enabled: false
        deployment: false
        cronjob: false
      templates:
        - condition:
            and:
              - regexp:
                  kubernetes.namespace: "^some-prefix.*"
              - not:
                  equals:
                    kubernetes.labels.some-label: "some-value"
          config:
            - type: container
              paths:
                - /var/log/containers/*${data.kubernetes.container.id}.log
              tail_files: true
#filebeat.registry.flush: 60s # adding this fixed throughput
output.elasticsearch:
  index: default
  hosts: ["http://localhost:9200"]
  allow_older_versions: true
setup:
  ilm.enabled: false
  template.enabled: false

@clairmont32
Copy link

##16076 (comment)

I'm also experiencing this in 8.13.1 with only ~17k files created hourly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Filebeat Filebeat Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests