-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caddy is generating gigabytes of logs with: [INFO][FileStorage:*]Lock for * is stale #4448
Comments
Thanks for opening an issue! We'll look into this. It's not immediately clear to me what is going on, so I'll need your help to understand it better. Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either. I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily. This template will ask for some information you've already provided; that's OK, just fill it out the best you can. 👍 I've also included some helpful tips below the template. Feel free to let me know if you have any questions! Thank you again for your report, we look forward to resolving it! Template
Instructions -- please heed otherwise we cannot help you (help us help you!)
Example of a tutorial: Create a config file: |
Well, this is an old version of caddy, and the nature of the issue makes it very difficult to reproduce. I mostly filed a bug in case someone else stumbled upon it so it can gather information in the future, of if I experience it again. I'm closing it for now until we can find a reproducer. |
I've seen this error before too. |
No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue edit: a VM (virt-manager) on a physical machine |
By the way, I still have the issue with latest caddy edit: I configured log rotation better, so I get access to logs on a functionning host... I'll attach debug info next |
The problem seems tied to a certificate maintainance task that cannot succeed... The sequence of events:
|
That's interesting and good to know.
Wait. Are you still encountering hundreds of logs entries per second right now? rm /var/lib/caddy/locks/* And posting the output of |
Some locks seems to contain JSON, some seems empty, some seems to contain zeros:
|
It could be that because of logs filling up, caddy could not save the lock files properly resulting in corrupted files |
Reopening because this discussion/investigation is interesting. Worth a second look. |
Let's look at the time between when the system starts up and when Caddy starts generating those log lines (so I could perhaps make a small reproducer)...
The server rebooted since, and the disk being full, it may not exhibit the same behaviour. I'll remove a log and restart. By the way, /var/log/syslog and /var/log/messages both contains the log lines above, but syslog is 11G and messages is only 6.5G |
The bug clearly does not happen every time. Since this morning Caddy is quiet. Trying to run caddy outside of its systemd service, I cannot reproduce the problem. However if I give Caddy the autosaved JSON config, I have the error. The difference between those two files is exclusively in the domain list Edit, the difference in the domain list, I suspect it is only relevant for @@ -506,12 +578,18 @@
"certificates": {
"@id": "tls.certificates",
"automate": [
- "webmsg.me",
- "www.webmsg.me",
"bourderie-1.webmsg.me",
"www.bourderie-1.webmsg.me",
+ "dav.bourderie-1.webmsg.me",
"jmap.bourderie-1.webmsg.me",
- "dav.bourderie-1.webmsg.me"
+ "mildred.fr",
+ "www.mildred.fr",
+ "dav.mildred.fr",
+ "jmap.mildred.fr",
+ "test.webmsg.me",
+ "www.test.webmsg.me",
+ "dav.test.webmsg.me",
+ "jmap.test.webmsg.me"
]
}
} Running caddy for some time, and trying to get the list of domains triggering the offending log line yields every time the same domain:
Looking at the lock, it's a legitimate JSON file:
|
Trying to come up with a minimal config file, removing:
I come up with a minimal configuration with: {
"apps": {
"http": {
"servers": {
"master": {
"listen": [
"127.0.0.1:443"
],
"routes": [
{
"handle": [
{
"body": "TEST",
"handler": "static_response",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
]
}
}
],
"match": [
{
"host": [
"autoconfig.test.webmsg.me"
]
}
]
}
]
}
}
}
},
"logging": {
"logs": {
"default": {
"level": "INFO",
"writer": {
"output": "stdout"
}
}
}
},
"storage": {
"module": "file_system",
"root": "/var/lib/caddy"
}
} When I run Caddy with:
I get too many of:
|
Got a minimal reproducer:
config: {
"apps": {
"http": {
"servers": {
"master": {
"listen": [
"127.0.0.1:443"
],
"routes": [
{
"handle": [
{
"body": "TEST",
"handler": "static_response",
"headers": {
"Content-Type": [
"text/plain; charset=utf-8"
]
}
}
],
"match": [
{
"host": [
"autoconfig.test.webmsg.me"
]
}
]
}
]
}
}
}
},
"logging": {
"logs": {
"default": {
"level": "INFO",
"writer": {
"output": "stdout"
}
}
}
},
"storage": {
"module": "file_system",
"root": "/tmp/caddy-bug"
}
} Start Caddy and trigger bug:
|
edit: there is a problem with my reproducer, disregard this below There is really something around this domain that causes the bug. Its a NXDOMAIN, but not all NXDOMAINs are causing the bug
With another NXDOMAIN:
|
Got an easy reproducer, see first comment. I believe I ran out of disk space at some point which caused bad lock state. I increased since then the VM disk size, but the bad state was kept in the lock files. edit: The reason it did not happen every time is that the offending domain name is added later on via the admin API via an external script that needs to run. Or Caddy needs to restore a config with the offending domain. |
Okay so it seems to me like there's an infinite loop that needs fixing. I think it's in certmagic, not Caddy, actually. If you go searching for that error message in the code, you should be able to find it. (I'm away for the weekend, or I'd take closer look myself right now) |
I didn't found the message in Caddy's code, it's probably in a dependency, yes |
Thanks for the added details, I will circle back around to this when I get a chance! (or someone else can beat me to it, of course) |
We aren't checking the error from So it's possible that the removal is failing and causing an infinite loop. However, the It looks like in the original implementation of this atomic unlocking, the point of it being atomic was to clean up empty dirs after removing a lock, but now the caller does that in a separate function of CleanStorage(). So I wonder if we don't even need atomic unlocking anymore. We might be able to simply do I do not know of a way to do an atomic remove that doesn't involve creating more state that can be interrupted, like was the case here. It's rare, yes; but a lingering Basically, I'm thinking it might be best to just change the "atomic unlock" to a regular Thoughts? |
I went ahead and committed a fix that I hope will work. @mildred can you give it a try? The mutual exclusion guarantees are weaker now in the presence of stale locks, but I suspect races on those will be rarer than the infinite looping you were experiencing (and less problematic too). |
Should fix caddyserver/caddy#4448 Weaker mutual exclusion guarantees, but probably the better alternative
I have a caddy server that is blocked in a loop, generating thousands of message, it seems as fast as it is able to, filling the hard drive with gigabytes of logs until the disk is full.
The log line is:
All log lines are exactly the same and I suspect hundreds lines per second.
Running caddy
v2.3.0 h1:fnrqJLa3G5vfxcxmOH/+kJOcunPLhSBnjgIvjXV/QTA=
obtained withxcaddy build --with github.com/mholt/caddy-l4@master --output /usr/local/bin/caddy
edit: It seems to occur in more recent version of Caddy too, and there are other log lines filling up the logs than this one. In the configuration that causes this error, Caddy by misconfiguration is assigned to obtain TLS certificates that do not redirect to this machine, resulting in a forever failure to obtain the certificates.
1. Environment
1a. Operating system and version
Debian GNU/Linux 11 (bullseye), running a KVM guest (libvirt) on a bare-metal Fedora host hosted on consumer hardware
1b. Caddy version (run
caddy version
or paste commit SHA)v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=
1c. Go version (if building Caddy from source; run
go version
)downloaded from https://caddyserver.com/api/download?os=linux&arch=amd64&p=github.com%2Fmholt%2Fcaddy-l4
2. Description
2a. What happens (briefly explain what is wrong)
Caddy generated hundreds of log lines at INFO level, filling up the machine disk. More than 10G in a night.
2c. Log output
Too much to paste in full, see : #4448 (comment)
2d. Workaround(s)
None yet, tweak log rotation to at least be able to purge old logs more easily while keeping some of the logs.Cleaning up the locks directory, and especially removing the
*.lock.unlock
files where there are matching JSON*.lock
files2e. Relevant links
N/A
3. Tutorial (minimal steps to reproduce the bug)
Run this shell script (as root to be able to bind port 443):
The text was updated successfully, but these errors were encountered: