Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit logging by cloudprober container #679

Closed
ls692 opened this issue Feb 7, 2024 · 7 comments
Closed

Limit logging by cloudprober container #679

ls692 opened this issue Feb 7, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@ls692
Copy link
Collaborator

ls692 commented Feb 7, 2024

Describe the bug
We are seeing several cases of cloudprober fully using the local disk.

This is happening with versions from early Nov and we are not sure of the root-cause as we are unable to log into the broken VM :) due to disk-full errors. I suspect that cloud-logging calls fail or is backed up and we start logging to disk. And once disk gets full, we are unable to login.

From https://docs.docker.com/config/containers/logging/configure/ it seems that we can configure a max size. Is there any risk to adding something like say

docker run -e "SYSVARS=${VARS}" --env-file <(env | grep CLOUDPROBER_)
--log-opt "max-size=100m" --log-opt "max-file=3" --log-driver "local"
--net host --privileged -v /tmp:/tmp "${IMAGE}:${VERSION}"

to third_party/cloudprober/tools/cloudprober_startup.sh to use the local logging driver, limit to 3 files and limit to 100M size as a quick fix and work on making this configurable later on.

@ls692 ls692 added the bug Something isn't working label Feb 7, 2024
@manugarg
Copy link
Contributor

manugarg commented Feb 8, 2024

Hi @ls692,

We changed the underlying logger in Aug, 2023: #462 to use Go's structured logger (log/slog) instead of Google logger (https://github.com/golang/glog).

Interestingly the new logger doesn't write to disk at all. It simply emits logs to stderr and it does that regardless of whether logs are going to cloud-logging or not.

Does the internal version have the change above? Do you see any logs being written to /tmp at all ... there should not be any, but I am wondering if system logger (journald) is siphoning off stderr logs from the service and storing them on the disk. If that's what is happening, I found this to disable those logs in systemd:
https://serverfault.com/questions/858843/how-to-disable-logging-for-a-particular-systemd-unit

There is also a way to configure journald instead of dropping everything:
https://unix.stackexchange.com/questions/696781/turn-off-logging-to-journald-for-specified-systemd-service

@ls692
Copy link
Collaborator Author

ls692 commented Feb 9, 2024

It does look like stderr is being pulled in by journald, atleast from the VMs that are functioning fine. Its not going to tmp but to /var/log/journald and it is being reported as the output of "cloudprober_updater.sh" I will try disabling logging for this module to get back to the old behavior.

@ls692
Copy link
Collaborator Author

ls692 commented Feb 14, 2024

After some debugging, I am now fairly confident that it is the cloudprober logger that is causing disk usage issues. journald has protection for SystemMaxUse (1GB) and SystemKeepFree(default of 15% of disk size), so journald should not have used all of the disk.

I do see disk usage when inspecting a sample container (details at the bottom). docker documentation (link in description) also indicates that the default json-file logger can use up all disk space but again I could not verify from an existing "bad" VM.

The only reason we seem to be running into this is that some of the probes are very noisy (e.g., during initialization time when things are broken) so in most other cases the default logger is generates pretty reasonable size of logs.

However, given the high cost of fixing (recreating the VM), limiting logging might be a good tradeoff.

$ # From a VM running for ~3 days.
$ docker inspect $container_id
...
        "LogPath": "/var/lib/docker/containers/cddd93a95aaac56546d72e8e10563a87bed3f2088c11644c1b6999eb1a977494/cddd93a95aaac56546d72e8e10563a87bed3f2088c11644c1b6999eb1a977494-json.log",
...
        "HostConfig": {
            "Binds": [
                "/tmp:/tmp"
            ],
            "ContainerIDFile": "",
            "LogConfig": {
                "Type": "json-file",
                "Config": {
                    "tag": "{{.Name}}"
                }
            },
...

$ du -chs /var/lib/docker/containers/cddd93a95aaac56546d72e8e10563a87bed3f2088c11644c1b6999eb1a977494/cddd93a95aaac56546d72e8e10563a87bed3f2088c11644c1b6999eb1a977494-json.log
19M	/var/lib/docker/containers/cddd93a95aaac56546d72e8e10563a87bed3f2088c11644c1b6999eb1a977494/cddd93a95aaac56546d72e8e10563a87bed3f2088c11644c1b6999eb1a977494-json.log

$ ps -ef | grep docker
docker run -e SYSVARS=kernel=6.1.58+,google_release=17800.66.78,cloudprober_tag=20240205_RC00,cloudprober_version=sha256:9939d --env-file /dev/fd/63 --net host --privileged -v /tmp:/tmp gcr.io/cloudradar-hb-ipv6/cloudradar_hb_ipv6_cloudprober:20240205_RC00

@manugarg
Copy link
Contributor

manugarg commented Feb 15, 2024

Thanks Lenin, for looking further! I had no idea that docker writes container logs to a local file with no default size limit (this is problematic in general).

Looking at https://docs.docker.com/config/containers/logging/configure/, we've the following options:

  1. Restore the old behavior, i.e. write to cloud logging only if running on gcp.
  2. Use one of the following flags with docker: --log-opt max-size=500m (limit docker log), --log-driver=journald (let journald handle it). I think letting journald handle it probably better.

We can restore the old behavior but it had a problem that if logging fails for some reason (e.g. if logging API is not enabled or out of quota), user may not come to know what's going on -- or maybe logging package will output something when that happens, I'll try to find out. Also, it might make sense to change the startup script option to set log-driver=journald anyway.

@manugarg
Copy link
Contributor

As per Google's cloud logging package:

type Client struct {

	// OnError is called when an error occurs in a call to Log or Flush. The
	// error may be due to an invalid Entry, an overflow because BufferLimit
	// was reached (in which case the error will be ErrOverflow) or an error
	// communicating with the logging service. OnError is called with errors
	// from all Loggers. It is never called concurrently. OnError is expected
	// to return quickly; if errors occur while OnError is running, some may
	// not be reported. The default behavior is to call log.Printf.
	//
	// This field should be set only once, before any method of Client is called.
	OnError func(err [error](https://pkg.go.dev/builtin#error))
	// contains filtered or unexported fields
}

So it seems it will print to stderr if there is a problem flushing logs.

@ls692
Copy link
Collaborator Author

ls692 commented Feb 16, 2024

I prefer journald a bit more because it allows for better configuration of logging. Let me propose a CL for that.

@ls692 ls692 self-assigned this Feb 16, 2024
@manugarg
Copy link
Contributor

manugarg commented Mar 6, 2024

Lenin, assuming this resolved.

@manugarg manugarg closed this as completed Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants