You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a CIS hardened (level 2) Charmed Openstack control node hosting 25 LXDs running Openstack control plane services, installing and running grafana-agent inside those LXDs caused massive amounts of logs to be written to audit.log on the host level (12G in less than a day, then it just ran out of disk space).
Pretty much all these "new" entries in audit.log are reports of grafana-agent accessing /var/log/.../*log files.
Logs from /var/log/aodh/aodh-evaluator.log (and all other files logged in audit.log) are searchable in Loki and everything else looks fine. There aren't any related errors being reported in the logs of the grafana-agent running inside the LXD.
Additionally, not all files accessed by grafana-agent in the LXDs are reported in audit.log on the host level. The main difference seems to be ownership of log files and directories. For example, I see many logs reporting /var/log/aodh/*.log files, /var/log/barbican/.log files, etc. but nothing about /var/log/juju/*.log or /var/log/syslog.
Their ownership is as follows:
# no entries in audit.log for these
227484779 drwxr-xr-x 2 syslog adm 4.0K Jan 29 00:08 juju
227484432 -rw-r----- 1 syslog adm 2.6M Jan 31 12:43 syslog
# these are being reported constantly
227484930 drwxr-x--- 2 barbican adm 12K Jan 31 00:00 barbican
227484609 -rw-rw-r-- 1 root utmp 286K Jan 31 12:34 lastlog
227484694 -rw-rw---- 1 hacluster haclient 0 Dec 25 2022 pacemaker.log
227484449 -rw-rw-r-- 1 root utmp 183K Jan 31 12:34 wtmp
It seems that as long as files are owned by syslog:adm, grafana-agent's syscalls are not recorded. Accessing files owned by root, barbican (OpenStack service user), hacluster users, results in massive amounts of audit logs.
This may or may not be related to group membership of these user accounts:
root@juju-5f7845-5-lxd-1:/var/log# groups barbican
barbican : barbican
root@juju-5f7845-5-lxd-1:/var/log# groups hacluster
hacluster : haclient
root@juju-5f7845-5-lxd-1:/var/log# groups root
root : root sudo lxd
root@juju-5f7845-5-lxd-1:/var/log# groups syslog
syslog : syslog adm tty
This massive audit.log spam may have catastrophic results, for example, if the CIS "4.1.2.3 Ensure system is disabled when audit logs are full" rule is in place, in the worst case it may just shut down the system after running out of space on the /var/log/audit partition.
The issue doesn't occur with filebeat for example, so it might be also related to grafana-agent being a snap.
Is there anything that can be tweaked in grafana-agent snap that could help with this?
Also, my recommendation is to avoid relating grafana-agent to Loki in any CIS-hardened deployments until this is resolved.
To Reproduce
Deploy grafana-agent in any Openstack control-plane LXD container running on a CIS-hardened host, relate it to Loki and watch /var/log/audit/audit.log.
Environment
CIS-hardened Ubuntu 20.04
grafana-agent 0.35.4 16 latest/stable 0x12b -
Charmed Openstack focal/ussuri
Relevant log output
Snippets posted above
audit.log file sizes for the sake of completeness:
-r--r----- 1 root adm 9.3G Jan 30 00:00 audit.log-20240130_000001
-rw-r----- 1 root adm 0 Jan 30 00:00 audit.log.1
-rw-r----- 1 root adm 2.1G Jan 30 06:28 audit.log
Additional context
This is a potential blocker for grafana-agent deployments on CIS-hardened clouds.
The text was updated successfully, but these errors were encountered:
Hi @przemeklal, I managed to relate grafana agent to a cis hardened system (cis_level2_server) and I did not see this behavior. Could you see if you can reproduce the issue?
Deploying g-agent to a cis-hardened machine is not enough to reproduce it. You should also deploy LXCs on that machine and relate g-agent to the apps running in these containers:
juju deploy ch:ubuntu inner --to lxd:0 --series focal # where 0 is your cis-hardened machine id
juju relate grafana-agent inner
Once g-agents inside these LXDs try to access stuff in /var/log, auditd log spam starts one level below, on the LXD "host".
FWIW, LXDs should also be hardened but it would be interesting to see what happens if they're not.
I am still having issues reproducing this nested lxd setup. If someone could reach out and schedule some time to walk me through it that would be very helpful.
Bug Description
On a CIS hardened (level 2) Charmed Openstack control node hosting 25 LXDs running Openstack control plane services, installing and running grafana-agent inside those LXDs caused massive amounts of logs to be written to audit.log on the host level (12G in less than a day, then it just ran out of disk space).
Pretty much all these "new" entries in audit.log are reports of grafana-agent accessing
/var/log/.../*log
files.Typical entries in audit.log look like this one:
Logs from
/var/log/aodh/aodh-evaluator.log
(and all other files logged in audit.log) are searchable in Loki and everything else looks fine. There aren't any related errors being reported in the logs of the grafana-agent running inside the LXD.Additionally, not all files accessed by grafana-agent in the LXDs are reported in audit.log on the host level. The main difference seems to be ownership of log files and directories. For example, I see many logs reporting
/var/log/aodh/*.log
files,/var/log/barbican/.log
files, etc. but nothing about/var/log/juju/*.log
or/var/log/syslog
.Their ownership is as follows:
It seems that as long as files are owned by syslog:adm, grafana-agent's syscalls are not recorded. Accessing files owned by root, barbican (OpenStack service user), hacluster users, results in massive amounts of audit logs.
This may or may not be related to group membership of these user accounts:
This massive audit.log spam may have catastrophic results, for example, if the CIS "4.1.2.3 Ensure system is disabled when audit logs are full" rule is in place, in the worst case it may just shut down the system after running out of space on the /var/log/audit partition.
The issue doesn't occur with filebeat for example, so it might be also related to grafana-agent being a snap.
Is there anything that can be tweaked in grafana-agent snap that could help with this?
Also, my recommendation is to avoid relating grafana-agent to Loki in any CIS-hardened deployments until this is resolved.
To Reproduce
Deploy grafana-agent in any Openstack control-plane LXD container running on a CIS-hardened host, relate it to Loki and watch /var/log/audit/audit.log.
Environment
CIS-hardened Ubuntu 20.04
Charmed Openstack focal/ussuri
Relevant log output
Snippets posted above audit.log file sizes for the sake of completeness: -r--r----- 1 root adm 9.3G Jan 30 00:00 audit.log-20240130_000001 -rw-r----- 1 root adm 0 Jan 30 00:00 audit.log.1 -rw-r----- 1 root adm 2.1G Jan 30 06:28 audit.log
Additional context
This is a potential blocker for grafana-agent deployments on CIS-hardened clouds.
The text was updated successfully, but these errors were encountered: