Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GA support for reading from journald #37086

Open
8 tasks
cmacknz opened this issue Nov 10, 2023 · 12 comments
Open
8 tasks

GA support for reading from journald #37086

cmacknz opened this issue Nov 10, 2023 · 12 comments
Assignees
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team

Comments

@cmacknz
Copy link
Member

cmacknz commented Nov 10, 2023

As of Debian 12 system logs are exclusively available via journald by default. Today we support reading journald logs via the Filebeat journald input, which is still in technical preview and has several major bugs filed against it. See https://github.com/elastic/beats/issues?q=is%3Aissue+is%3Aopen+journald notably:

We need to provide a GA way to read journald logs. There are two paths to this:

  1. Fix the major issues in the journald input and GA it as is. All integrations that previously read syslog files by default will need a conditional to specify that journald should be used instead of one of the log files on Linux (see example. Possibly this conditional will need to be on the Linux distribution and not just Linux as a platform.
  2. Fold the existing journald functionality into filestream, so that there is only one way to read log files and all existing uses of filestream to read system logs continue to work with no or minimal modification. In the ideal case we detect we are reading journald logs based on a .journal extension or well known file paths, but we may need a configuration flag for this. If we do end up with a configuration flag we could consider implementing journald support as a type of parser https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-filestream.html#_parsers

Edit:
Option 1 is the path forward, we'll keep the separate journald input.

To close this issue we'll need to:

Tasks

  1. Team:Elastic-Agent Team:Elastic-Agent-Data-Plane bug
    belimawr
  2. Team:Elastic-Agent-Data-Plane bug
  3. Filebeat Team:Elastic-Agent Team:Elastic-Agent-Data-Plane bug
  4. Team:Elastic-Agent enhancement
  5. Team:Elastic-Agent docs
  6. Team:Elastic-Agent enhancement
  7. Integration:journald Team:Elastic-Agent-Data-Plane
  8. Team:Elastic-Agent-Data-Plane bug
@cmacknz cmacknz added the Team:Elastic-Agent Label for the Agent team label Nov 10, 2023
@cmacknz
Copy link
Member Author

cmacknz commented Nov 10, 2023

@rdner I am interested to get your opinion on this given the amount of time you are spending trying to migrate and drive consistency between the log, filestream, and container input that already exist.

@leehinman
Copy link
Contributor

For Option1 do we have to provide a conditional? I think both inputs could be enabled at the same time, it would just have to be non-fatal for the source not to be present. For example you can enable both journald, logfile & udp in iptables integration all at the same time. (And UDP and journald are on by default)

@cmacknz
Copy link
Member Author

cmacknz commented Nov 14, 2023

If we don't have a conditional we risk duplicated logs. I think if we defaulted to always using both inputs we'd get a small amount of duplicated logs today on Debian 11, it looks like the kernal boot logs go to both journald and /var/log/

craig_mackenzie@cmackenzie-debian11-test:~$ journalctl
-- Journal begins at Tue 2023-11-14 20:15:17 UTC, ends at Tue 2023-11-14 20:19:35 UTC. --
Nov 14 20:15:17 debian kernel: Linux version 5.10.0-26-cloud-amd64 (debian-kernel@lists.debian.org) (gcc-10 (D>
Nov 14 20:15:17 debian kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.10.0-26-cloud-amd64 root=UUID=62c0943b>
Nov 14 20:15:17 debian kernel: BIOS-provided physical RAM map:
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000000001000-0x0000000000054fff] usable
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000000055000-0x000000000005ffff] reserved
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000000060000-0x0000000000097fff] usable
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000000098000-0x000000000009ffff] reserved
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bf8ecfff] usable
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bf8ed000-0x00000000bf9ecfff] reserved
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bf9ed000-0x00000000bfaecfff] type 20
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bfaed000-0x00000000bfb6cfff] reserved
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bfb6d000-0x00000000bfb7efff] ACPI data
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bfb7f000-0x00000000bfbfefff] ACPI NVS
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bfbff000-0x00000000bffdffff] usable
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
Nov 14 20:15:17 debian kernel: BIOS-e820: [mem 0x0000000100000000-0x000000013fffffff] usable
Nov 14 20:15:17 debian kernel: printk: bootconsole [earlyser0] enabled
Nov 14 20:15:17 debian kernel: NX (Execute Disable) protection: active
Nov 14 20:15:17 debian kernel: efi: EFI v2.70 by EDK II
Nov 14 20:15:17 debian kernel: efi: TPMFinalLog=0xbfbf7000 ACPI=0xbfb7e000 ACPI 2.0=0xbfb7e014 SMBIOS=0xbf9ca0>
Nov 14 20:15:17 debian kernel: secureboot: Secure boot disabled
Nov 14 20:15:17 debian kernel: SMBIOS 2.4 present.
Nov 14 20:15:17 debian kernel: DMI: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Nov 14 20:15:17 debian kernel: Hypervisor detected: KVM
Nov 14 20:15:17 debian kernel: kvm-clock: Using msrs 4b564d01 and 4b564d00
Nov 14 20:15:17 debian kernel: kvm-clock: cpu 0, msr 78801001, primary cpu clock
Nov 14 20:15:17 debian kernel: kvm-clock: using sched offset of 7655756989 cycles
Nov 14 20:15:17 debian kernel: clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max>
Nov 14 20:15:17 debian kernel: tsc: Detected 2200.158 MHz processor
Nov 14 20:15:17 debian kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
Nov 14 20:15:17 debian kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
Nov 14 20:15:17 debian kernel: last_pfn = 0x140000 max_arch_pfn = 0x400000000
Nov 14 20:15:17 debian kernel: MTRR default type: write-back
Nov 14 20:15:17 debian kernel: MTRR fixed ranges enabled:
Nov 14 20:15:17 debian kernel:   00000-9FFFF write-back
Nov 14 20:15:17 debian kernel:   A0000-FFFFF uncachable

craig_mackenzie@cmackenzie-debian11-test:~$ grep -rn 'kvm-clock: cpu 0, msr 78801001, primary cpu cloc' /var/log
grep: /var/log/journal/3465bc73197d954b92a16251605729f5/system.journal: binary file matches
grep: /var/log/private: Permission denied
grep: /var/log/btmp: Permission denied
/var/log/syslog:125:Nov 14 20:15:18 debian kernel: [    0.000000] kvm-clock: cpu 0, msr 78801001, primary cpu clock
/var/log/messages:27:Nov 14 20:15:18 debian kernel: [    0.000000] kvm-clock: cpu 0, msr 78801001, primary cpu clock
grep: /var/log/chrony: Permission denied
/var/log/kern.log:27:Nov 14 20:15:18 debian kernel: [    0.000000] kvm-clock: cpu 0, msr 78801001, primary cpu clock

Granted if someone set their logs path to /var/log/*.log they'd pick up these logs from both syslog.log and kern.log today anyway.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 14, 2023

It also looks like the journald input is using go-systemd/sdjournal/ which is just wrapping the systemd journal C API:

https://github.com/coreos/go-systemd/blob/7d375ecc2b092916968b5601f74cca28a8de45dd/sdjournal/journal.go#L424-L434

func NewJournal() (j *Journal, err error) {
	j = &Journal{}

	sd_journal_open, err := getFunction("sd_journal_open")
	if err != nil {
		return nil, err
	}

	r := C.my_sd_journal_open(sd_journal_open, &j.cjournal, C.SD_JOURNAL_LOCAL_ONLY)

This wouldn't fit with the idea of just using a filestream parser for journald, at best we could just hide the entire journald input inside filestream so there's a single log input, but we'd probably still need dedicated configuration specific to reading journald files.

@cmacknz
Copy link
Member Author

cmacknz commented Nov 14, 2023

The default journald configuration that reads everything is only two lines so I think at this point I'm convinced that keeping the journald input and improving it is the best path:

# Read all journald logs
- type: journald
  id: everything

I don't think folding this into filestream will make filestream easier to use, or be easier to maintain.

@rdner
Copy link
Member

rdner commented Nov 15, 2023

To summarise what we discussed with @cmacknz on a call:

  1. I think we should detect whether the OS has journald in the agent and add a new variable in the host object for the integration templates to use it like we use the condition here https://github.com/elastic/integrations/blob/f1b08ddd00724eaf3b8d9eb9ef2221f8fc7eefc4/packages/system/data_stream/system/agent/stream/winlog.yml.hbs#L2C1-L2C41

  2. I think the users who use the agent are not interested in deep configuration, so the integrations should deal with logs in both files and journald seamlessly for the user.

  3. Users who run standalone Filebeat are used to manual configurations and should be able to take care of configuring the right input for the right OS/distribution – journald or filestream.

  4. journald should remain separate, it's not really compatible with the filestream architecture and consumes logs via special syscalls.

@andrewkroh
Copy link
Member

andrewkroh commented Jan 10, 2024

A few things that come to mind related to journald:

  • The input produces large events with lots of metadata. This could have an impact on storage usage. It also might make sense to drop some of the fields.
  • The input is not optimized for producing ECS fields. IIRC it does populate ECS fields but it also duplicates the same data into non-ECS fields. It would be much better to optimize the events coming out of the input before turning this on for users by default. All journald input users would benefit IMO.
  • To make reading data most efficient for each data stream (system.syslog and system.auth) ideally we would use journalctl filtering (e.g. system.auth might use _TRANSPORT=syslog). So we need to figure out if that associated data is available in journald and what are the appropriate filters that can be used to select it. Implementing filter in the Beat using processors would be less that ideal for efficiency. (Viewing the data with sudo journalctl -o export is a great way to determine what filtering might work).

@cmacknz
Copy link
Member Author

cmacknz commented Jan 11, 2024

Thanks, I think it would make sense to compare the events collected without the journald inputs to those collected with it for the sources needed by the system integration. If the event content is significantly different it will cause problems for dashboards and queries.

@andrewkroh
Copy link
Member

This is alluded to in some linked issues, but I wanted to explicitly mention that the journald library version in our container images is v245 (from Mar 6, 2020), and when deploying this image on Ubuntu 22.04 nodes, which uses v249, you can't collect logs from the host (no crashes, just no logs). My workaround has been to repack the filebeat binaries in with a more recent base image. We might want to consider bumping our base image as part of making this GA.

@belimawr
Copy link
Contributor

belimawr commented May 1, 2024

I found another bug, probably another blocker: #39352

It seems that if Filebeat falls too far behind with the Journal the input will crash shortly after starting.

@belimawr
Copy link
Contributor

belimawr commented May 1, 2024

#32782 and #39352 happen intermittently on my test environments, so far I did not manage to isolate them but they both are coming from a call to github.com/coreos/go-systemd/v22/sdjournal

entry, err := r.journal.GetEntry()

#39352 I only managed to reproduce with Journald systemd 252 (252.16-1.amzn2023.0.2)

@pierrehilbert pierrehilbert added the Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team label May 5, 2024
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Elastic-Agent Label for the Agent team Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

No branches or pull requests

7 participants