Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caddy is generating gigabytes of logs with: [INFO][FileStorage:*]Lock for * is stale #4448

Closed
mildred opened this issue Nov 26, 2021 · 22 comments
Labels
bug 🐞 Something isn't working upstream ⬆️ Relates to some dependency of this project

Comments

@mildred
Copy link

mildred commented Nov 26, 2021

I have a caddy server that is blocked in a loop, generating thousands of message, it seems as fast as it is able to, filling the hard drive with gigabytes of logs until the disk is full.

The log line is:

Nov 26 22:03:13 mail caddy[1072]: 2021/11/26 22:02:07 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock

All log lines are exactly the same and I suspect hundreds lines per second.

Running caddy v2.3.0 h1:fnrqJLa3G5vfxcxmOH/+kJOcunPLhSBnjgIvjXV/QTA= obtained with xcaddy build --with github.com/mholt/caddy-l4@master --output /usr/local/bin/caddy


edit: It seems to occur in more recent version of Caddy too, and there are other log lines filling up the logs than this one. In the configuration that causes this error, Caddy by misconfiguration is assigned to obtain TLS certificates that do not redirect to this machine, resulting in a forever failure to obtain the certificates.

1. Environment

1a. Operating system and version

Debian GNU/Linux 11 (bullseye), running a KVM guest (libvirt) on a bare-metal Fedora host hosted on consumer hardware

1b. Caddy version (run caddy version or paste commit SHA)

v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

1c. Go version (if building Caddy from source; run go version)

downloaded from https://caddyserver.com/api/download?os=linux&arch=amd64&p=github.com%2Fmholt%2Fcaddy-l4

2. Description

2a. What happens (briefly explain what is wrong)

Caddy generated hundreds of log lines at INFO level, filling up the machine disk. More than 10G in a night.

2c. Log output

Too much to paste in full, see : #4448 (comment)

2d. Workaround(s)

None yet, tweak log rotation to at least be able to purge old logs more easily while keeping some of the logs.

Cleaning up the locks directory, and especially removing the *.lock.unlock files where there are matching JSON *.lock files

2e. Relevant links

N/A

3. Tutorial (minimal steps to reproduce the bug)

Run this shell script (as root to be able to bind port 443):

#!/bin/sh

set -ex

CADDY=$PWD/caddy
CONFIG=$PWD/caddy-bug.json
CADDYHOME=$PWD/caddy-bug
DOMAIN=nxdomain.test.webmsg.me

mkdir -p $CADDYHOME/locks
touch $CADDYHOME/locks/issue_cert_$DOMAIN.lock.unlock
echo '{"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}' >$CADDYHOME/locks/issue_cert_$DOMAIN.lock

if ! [ -e $CADDY ]; then
	curl "https://caddyserver.com/api/download?os=linux&arch=amd64" -o $CADDY
	chmod +x $CADDY
fi

cat <<CONFIG >$CONFIG
{
  "apps": {
    "http": {
      "servers": {
        "master": {
          "listen": [
            "127.0.0.1:443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "body": "TEST",
                  "handler": "static_response",
                  "headers": {
                    "Content-Type": [
                      "text/plain; charset=utf-8"
                    ]
                  }
                }
              ],
              "match": [
                {
                  "host": [
                    "$DOMAIN"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "level": "INFO",
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "$CADDYHOME"
  }
}
CONFIG

$CADDY run --watch --config $CONFIG
@mholt
Copy link
Member

mholt commented Nov 26, 2021

Thanks for opening an issue! We'll look into this.

It's not immediately clear to me what is going on, so I'll need your help to understand it better.

Ideally, we need to be able to reproduce the bug in the most minimal way possible. This allows us to write regression tests to verify the fix is working. If we can't reproduce it, then you'll have to test our changes for us until it's fixed -- and then we can't add test cases, either.

I've attached a template below that will help make this easier and faster! This will require some effort on your part -- please understand that we will be dedicating time to fix the bug you are reporting if you can just help us understand it and reproduce it easily.

This template will ask for some information you've already provided; that's OK, just fill it out the best you can. 👍 I've also included some helpful tips below the template. Feel free to let me know if you have any questions!

Thank you again for your report, we look forward to resolving it!

Template

## 1. Environment

### 1a. Operating system and version

```
paste here
```


### 1b. Caddy version (run `caddy version` or paste commit SHA)

```
paste here
```


### 1c. Go version (if building Caddy from source; run `go version`)

```
paste here
```


## 2. Description

### 2a. What happens (briefly explain what is wrong)




### 2b. Why it's a bug (if it's not obvious)




### 2c. Log output

```
paste terminal output or logs here
```



### 2d. Workaround(s)




### 2e. Relevant links




## 3. Tutorial (minimal steps to reproduce the bug)




Instructions -- please heed otherwise we cannot help you (help us help you!)

  1. Environment: Please fill out your OS and Caddy versions, even if you don't think they are relevant. (They are always relevant.) If you built Caddy from source, provide the commit SHA and specify your exact Go version.

  2. Description: Describe at a high level what the bug is. What happens? Why is it a bug? Not all bugs are obvious, so convince readers that it's actually a bug.

    • 2c) Log output: Paste terminal output and/or complete logs in a code block. DO NOT REDACT INFORMATION except for credentials.
    • 2d) Workaround: What are you doing to work around the problem in the meantime? This can help others who encounter the same problem, until we implement a fix.
    • 2e) Relevant links: Please link to any related issues, pull requests, docs, and/or discussion. This can add crucial context to your report.
  3. Tutorial: What are the minimum required specific steps someone needs to take in order to experience the same bug? Your goal here is to make sure that anyone else can have the same experience with the bug as you do. You are writing a tutorial, so make sure to carry it out yourself before posting it. Please:

    • Start with an empty config. Add only the lines/parameters that are absolutely required to reproduce the bug.
    • Do not run Caddy inside containers.
    • Run Caddy manually in your terminal; do not use systemd or other init systems.
    • If making HTTP requests, avoid web browsers. Use a simpler HTTP client instead, like curl.
    • Do not redact any information from your config (except credentials). Domain names are public knowledge and often necessary for quick resolution of an issue!
    • Note that ignoring this advice may result in delays, or even in your issue being closed. 😞 Only actionable issues are kept open, and if there is not enough information or clarity to reproduce the bug, then the report is not actionable.

Example of a tutorial:

Create a config file:
{ ... }

Open terminal and run Caddy:

$ caddy ...

Make an HTTP request:

$ curl ...

Notice that the result is ___ but it should be ___.

@mholt mholt added the needs info 📭 Requires more information label Nov 26, 2021
@mildred
Copy link
Author

mildred commented Nov 26, 2021

Well, this is an old version of caddy, and the nature of the issue makes it very difficult to reproduce. I mostly filed a bug in case someone else stumbled upon it so it can gather information in the future, of if I experience it again. I'm closing it for now until we can find a reproducer.

@mildred mildred closed this as completed Nov 26, 2021
@emilylange
Copy link
Member

I've seen this error before too.
Did you (@mildred) run caddy in a cluster (multiple parallel instances using the same storage) or perhaps with a network filesystem?

@mildred
Copy link
Author

mildred commented Dec 3, 2021

No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue

edit: a VM (virt-manager) on a physical machine

@mildred
Copy link
Author

mildred commented Dec 3, 2021

By the way, I still have the issue with latest caddy v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

edit: I configured log rotation better, so I get access to logs on a functionning host... I'll attach debug info next

@mildred
Copy link
Author

mildred commented Dec 3, 2021

The problem seems tied to a certificate maintainance task that cannot succeed...

The sequence of events:

  • a system makes an API request to update the config apps.http.tls.certificates.automate and adds some domains to the list (generated from a database). In the list there is a test domain that will not succeed in getting the certificate for this host
  • TLS certificate maintainance starts immadiatly: Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.8930051,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc00031bdc0"}
  • caddy tries to acquire lock for various domains on the list and fails (lock file seems corrupt)
    • Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.908584,"logger":"tls.obtain","msg":"acquiring lock","identifier":"mildred.fr"}
    • Nov 28 05:28:32 mail caddy[1072]: {"level":"error","ts":1638073712.9096868,"logger":"tls","msg":"job failed","error":"mildred.fr: obtaining certificate: unable to acquire lock 'issue_cert_mildred.fr': decoding lockfile contents: invalid character '\\x00' looking for beginning of value"}
    • the same for the other domains, I also have the following errors...
    • Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.915032,"logger":"tls.obtain","msg":"acquiring lock","identifier":"dav.mildred.fr"}
    • Nov 28 05:28:32 mail caddy[1072]: {"level":"error","ts":1638073712.915886,"logger":"tls","msg":"job failed","error":"dav.mildred.fr: obtaining certificate: unable to acquire lock 'issue_cert_dav.mildred.fr': decoding lockfile contents: EOF"}
  • caddy stops certificate maintainance: Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.945054,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc000312000"}
  • caddy saves config: Nov 28 05:28:32 mail caddy[1072]: {"level":"info","ts":1638073712.9469566,"msg":"autosaved config (load with --resume flag)","file":"/var/lib/caddy/.config/caddy/autosave.json"}
  • some more config updates is performed on other parts of the configuration (updating some routing rules using generated host list)
  • certificate maintainance is run again, with the same errors as before, but with an additional error for some domains: Nov 28 05:28:33 mail caddy[1072]: 2021/11/28 05:28:33 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_dav.test.webmsg.me' is stale (created: 2021-11-27 15:32:02.49914457 +0100 CET, last update: 2021-11-27 15:36:32.026606202 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_dav.test.webmsg.me.lock
  • caddy saves config
  • ...

@emilylange
Copy link
Member

No, it's running a classical mounted filesystem on a physical machine. The same setup on a VPS does not have the issue

That's interesting and good to know.
I've only ever seen this when running on of a network filesystem.

By the way, I still have the issue with latest caddy v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

Wait. Are you still encountering hundreds of logs entries per second right now?
If so, try deleting the mentioned (un)lock file(s) manually. That should fix it, at least temporarily.
In your case that would be:

rm /var/lib/caddy/locks/*

And posting the output of ls -l /var/lib/caddy/locks/ before executing the rm would be helpful too.
(But only if you are getting spammed right now.)

@mildred
Copy link
Author

mildred commented Dec 3, 2021

Some locks seems to contain JSON, some seems empty, some seems to contain zeros:

root@mail:/var/lib/caddy/locks# ls -la
total 60
drwx------  2 caddy caddy  4096 Dec  3 23:01 .
drwxr-xr-x+ 8 caddy caddy 12288 Dec  3 23:01 ..
-rw-r--r--  1 caddy caddy     0 Nov 10 08:52 issue_cert_autoconfig.mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Nov 25 11:02 issue_cert_autoconfig.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_autoconfig.test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy    98 Oct  1 18:45 issue_cert_autoconfig.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:02 issue_cert_dav.mildred.fr.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_dav.mildred.fr.lock.unlock
-rw-r--r--  1 caddy caddy     0 Nov  9 16:46 issue_cert_jmap.mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Dec  3 23:27 issue_cert_jmap.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Aug 12 09:23 issue_cert_jmap.webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Nov  9 16:37 issue_cert_mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Oct 18 19:47 issue_cert_mildred.webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Sep 30 19:49 issue_cert_roundcube.mildred.fr.lock
-rw-r--r--  1 caddy caddy    98 Dec  3 23:27 issue_cert_roundcube.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy    98 Dec  3 23:27 issue_cert_webmail.mildred.fr.lock
-rw-r--r--  1 caddy caddy     0 May 28  2021 issue_cert_webmail.mildred.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_webmail.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_webmail.test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy    98 Oct  1 20:27 issue_cert_webmsg.me.lock
-rw-r--r--  1 caddy caddy    98 Oct 18 19:47 issue_cert_www.mildred.fr.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_www.test.webmsg.me.lock
-rw-r--r--  1 caddy caddy     0 Nov 25 11:03 issue_cert_www.test.webmsg.me.lock.unlock
-rw-r--r--  1 caddy caddy     0 Sep  6 09:04 issue_cert_www.webmsg.me.lock
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_autoconfig.test.webmsg.me.lock
00000000  7b 22 63 72 65 61 74 65  64 22 3a 22 32 30 32 31  |{"created":"2021|
00000010  2d 31 31 2d 32 35 54 31  30 3a 32 39 3a 30 38 2e  |-11-25T10:29:08.|
00000020  34 31 35 34 39 38 32 34  31 2b 30 31 3a 30 30 22  |415498241+01:00"|
00000030  2c 22 75 70 64 61 74 65  64 22 3a 22 32 30 32 31  |,"updated":"2021|
00000040  2d 31 31 2d 32 35 54 31  31 3a 30 32 3a 35 39 2e  |-11-25T11:02:59.|
00000050  38 31 38 37 37 32 34 34  31 2b 30 31 3a 30 30 22  |818772441+01:00"|
00000060  7d 0a                                             |}.|
00000062
root@mail:/var/lib/caddy/locks# file *
issue_cert_autoconfig.mildred.fr.lock:            empty
issue_cert_autoconfig.test.webmsg.me.lock:        JSON data
issue_cert_autoconfig.test.webmsg.me.lock.unlock: empty
issue_cert_autoconfig.webmsg.me.lock:             data
issue_cert_dav.mildred.fr.lock:                   empty
issue_cert_dav.mildred.fr.lock.unlock:            empty
issue_cert_jmap.mildred.fr.lock:                  empty
issue_cert_jmap.test.webmsg.me.lock:              JSON data
issue_cert_jmap.webmsg.me.lock:                   data
issue_cert_mildred.fr.lock:                       data
issue_cert_mildred.webmsg.me.lock:                data
issue_cert_roundcube.mildred.fr.lock:             data
issue_cert_roundcube.test.webmsg.me.lock:         JSON data
issue_cert_test.webmsg.me.lock:                   empty
issue_cert_test.webmsg.me.lock.unlock:            empty
issue_cert_webmail.mildred.fr.lock:               JSON data
issue_cert_webmail.mildred.webmsg.me.lock:        empty
issue_cert_webmail.test.webmsg.me.lock:           empty
issue_cert_webmail.test.webmsg.me.lock.unlock:    empty
issue_cert_webmsg.me.lock:                        data
issue_cert_www.mildred.fr.lock:                   data
issue_cert_www.test.webmsg.me.lock:               empty
issue_cert_www.test.webmsg.me.lock.unlock:        empty
issue_cert_www.webmsg.me.lock:                    empty
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_autoconfig.webmsg.me.lock
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00                                             |..|
00000062
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_jmap.webmsg.me.lock
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00                                             |..|
00000062
root@mail:/var/lib/caddy/locks# hexdump -C issue_cert_www.mildred.fr.lock
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00000060  00 00                                             |..|
00000062

@mildred
Copy link
Author

mildred commented Dec 3, 2021

It could be that because of logs filling up, caddy could not save the lock files properly resulting in corrupted files

@mholt
Copy link
Member

mholt commented Dec 4, 2021

Reopening because this discussion/investigation is interesting. Worth a second look.

@mholt mholt reopened this Dec 4, 2021
@mildred
Copy link
Author

mildred commented Dec 4, 2021

Let's look at the time between when the system starts up and when Caddy starts generating those log lines (so I could perhaps make a small reproducer)...

  • Dec 3 23:00:53 mail kernel: [ 0.000000] Linux version 5.10.0-9-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.70-1 (2021-09-30) (around 57% of the file)
  • Dec 4 00:06:12 mail caddy[913]: 2021/12/04 00:00:51 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock

The server rebooted since, and the disk being full, it may not exhibit the same behaviour. I'll remove a log and restart. By the way, /var/log/syslog and /var/log/messages both contains the log lines above, but syslog is 11G and messages is only 6.5G

@mildred
Copy link
Author

mildred commented Dec 4, 2021

The bug clearly does not happen every time. Since this morning Caddy is quiet.

Trying to run caddy outside of its systemd service, I cannot reproduce the problem. However if I give Caddy the autosaved JSON config, I have the error. The difference between those two files is exclusively in the domain list

Edit, the difference in the domain list, I suspect it is only relevant for apps.http.tls.certificates.automate is:

@@ -506,12 +578,18 @@
       "certificates": {
         "@id": "tls.certificates",
         "automate": [
-          "webmsg.me",
-          "www.webmsg.me",
           "bourderie-1.webmsg.me",
           "www.bourderie-1.webmsg.me",
+          "dav.bourderie-1.webmsg.me",
           "jmap.bourderie-1.webmsg.me",
-          "dav.bourderie-1.webmsg.me"
+          "mildred.fr",
+          "www.mildred.fr",
+          "dav.mildred.fr",
+          "jmap.mildred.fr",
+          "test.webmsg.me",
+          "www.test.webmsg.me",
+          "dav.test.webmsg.me",
+          "jmap.test.webmsg.me"
         ]
       }
     }

Running caddy for some time, and trying to get the list of domains triggering the offending log line yields every time the same domain:

root@mail:/home/postmaster# /usr/local/bin/caddy run --watch --config autosave.json 2>&1 | grep 'is stale' | cut -d: -f9- | uniq
02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock

Looking at the lock, it's a legitimate JSON file:

root@mail:/home/postmaster# hexdump -C /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock
00000000  7b 22 63 72 65 61 74 65  64 22 3a 22 32 30 32 31  |{"created":"2021|
00000010  2d 31 31 2d 32 35 54 31  30 3a 32 39 3a 30 38 2e  |-11-25T10:29:08.|
00000020  34 31 35 34 39 38 32 34  31 2b 30 31 3a 30 30 22  |415498241+01:00"|
00000030  2c 22 75 70 64 61 74 65  64 22 3a 22 32 30 32 31  |,"updated":"2021|
00000040  2d 31 31 2d 32 35 54 31  31 3a 30 32 3a 35 39 2e  |-11-25T11:02:59.|
00000050  38 31 38 37 37 32 34 34  31 2b 30 31 3a 30 30 22  |818772441+01:00"|
00000060  7d 0a                                             |}.|
00000062

@mildred
Copy link
Author

mildred commented Dec 4, 2021

Trying to come up with a minimal config file, removing:

  • admin API
  • layer4 plugin
  • most routes expect a static_response
  • most listen addresses
  • hosts except offending host
  • apps.http.tls.certificates.automate which does not contains the offending host

I come up with a minimal configuration with:

{
  "apps": {
    "http": {
      "servers": {
        "master": {
          "listen": [
            "127.0.0.1:443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "body": "TEST",
                  "handler": "static_response",
                  "headers": {
                    "Content-Type": [
                      "text/plain; charset=utf-8"
                    ]
                  }
                }
              ],
              "match": [
                {
                  "host": [
                    "autoconfig.test.webmsg.me"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "level": "INFO",
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "/var/lib/caddy"
  }
}

When I run Caddy with:

/usr/local/bin/caddy run --watch --config autosave-2.json

I get too many of:

2021/12/04 23:16:09 [INFO][FileStorage:/var/lib/caddy] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /var/lib/caddy/locks/issue_cert_autoconfig.test.webmsg.me.lock

@mildred
Copy link
Author

mildred commented Dec 4, 2021

Got a minimal reproducer:

mkdir -p /tmp/caddy-bug/locks
touch /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock.unlock
echo '{"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}' >/tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock

config:

{
  "apps": {
    "http": {
      "servers": {
        "master": {
          "listen": [
            "127.0.0.1:443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "body": "TEST",
                  "handler": "static_response",
                  "headers": {
                    "Content-Type": [
                      "text/plain; charset=utf-8"
                    ]
                  }
                }
              ],
              "match": [
                {
                  "host": [
                    "autoconfig.test.webmsg.me"
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  },
  "logging": {
    "logs": {
      "default": {
        "level": "INFO",
        "writer": {
          "output": "stdout"
        }
      }
    }
  },
  "storage": {
    "module": "file_system",
    "root": "/tmp/caddy-bug"
  }
}

Start Caddy and trigger bug:

/usr/local/bin/caddy run --watch --config config.json

@mildred
Copy link
Author

mildred commented Dec 4, 2021

edit: there is a problem with my reproducer, disregard this below

There is really something around this domain that causes the bug. Its a NXDOMAIN, but not all NXDOMAINs are causing the bug

postmaster@mail:~/bug$ sudo ./trigger-bug.sh 
+ CADDY=/home/postmaster/bug/caddy
+ CONFIG=/home/postmaster/bug/caddy-bug.json
+ CADDYHOME=/home/postmaster/bug/caddy-bug
+ DOMAIN=autoconfig.test.webmsg.me
+ mkdir -p /home/postmaster/bug/caddy-bug/locks
+ touch /home/postmaster/bug/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock.unlock
+ echo {"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}
+ [ -e /home/postmaster/bug/caddy ]
+ cat
+ /home/postmaster/bug/caddy run --watch --config /home/postmaster/bug/caddy-bug.json
2021/12/04 22:42:23.356	INFO	using provided configuration	{"config_file": "/home/postmaster/bug/caddy-bug.json", "config_adapter": ""}
2021/12/04 22:42:23.358	INFO	redirected default logger	{"from": "stderr", "to": "stdout"}
2021/12/04 22:42:23.361	INFO	admin	admin endpoint started	{"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["localhost:2019", "[::1]:2019", "127.0.0.1:2019"]}
2021/12/04 22:42:23.363	INFO	http	server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS	{"server_name": "master", "https_port": 443}
2021/12/04 22:42:23.363	INFO	http	enabling automatic HTTP->HTTPS redirects	{"server_name": "master"}
2021/12/04 22:42:23.363	INFO	tls.cache.maintenance	started background certificate maintenance	{"cache": "0xc00030fb90"}
2021/12/04 22:42:23.364	INFO	http	enabling automatic TLS certificate management	{"domains": ["autoconfig.test.webmsg.me"]}
2021/12/04 22:42:23.366	INFO	autosaved config (load with --resume flag)	{"file": "/root/.config/caddy/autosave.json"}
2021/12/04 22:42:23.366	INFO	serving initial configuration
2021/12/04 22:42:23.366	INFO	watcher	watching config file for changes	{"config_file": "/home/postmaster/bug/caddy-bug.json"}
2021/12/04 22:42:23.368	INFO	tls	cleaning storage unit	{"description": "FileStorage:/tmp/caddy-bug"}
2021/12/04 22:42:23.368	INFO	tls	finished cleaning storage units
2021/12/04 22:42:23.369	INFO	tls.obtain	acquiring lock	{"identifier": "autoconfig.test.webmsg.me"}
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock
2021/12/04 23:42:23 [INFO][FileStorage:/tmp/caddy-bug] Lock for 'issue_cert_autoconfig.test.webmsg.me' is stale (created: 2021-11-25 10:29:08.415498241 +0100 CET, last update: 2021-11-25 11:02:59.818772441 +0100 CET); removing then retrying: /tmp/caddy-bug/locks/issue_cert_autoconfig.test.webmsg.me.lock

With another NXDOMAIN:

postmaster@mail:~/bug$ sudo ./trigger-bug.sh 
+ CADDY=/home/postmaster/bug/caddy
+ CONFIG=/home/postmaster/bug/caddy-bug.json
+ CADDYHOME=/home/postmaster/bug/caddy-bug
+ DOMAIN=nxdomain.test.webmsg.me
+ mkdir -p /home/postmaster/bug/caddy-bug/locks
+ touch /home/postmaster/bug/caddy-bug/locks/issue_cert_nxdomain.test.webmsg.me.lock.unlock
+ echo {"created":"2021-11-25T10:29:08.415498241+01:00","updated":"2021-11-25T11:02:59.818772441+01:00"}
+ [ -e /home/postmaster/bug/caddy ]
+ cat
+ /home/postmaster/bug/caddy run --watch --config /home/postmaster/bug/caddy-bug.json
2021/12/04 22:42:50.845	INFO	using provided configuration	{"config_file": "/home/postmaster/bug/caddy-bug.json", "config_adapter": ""}
2021/12/04 22:42:50.847	INFO	redirected default logger	{"from": "stderr", "to": "stdout"}
2021/12/04 22:42:50.869	INFO	admin	admin endpoint started	{"address": "tcp/localhost:2019", "enforce_origin": false, "origins": ["[::1]:2019", "127.0.0.1:2019", "localhost:2019"]}
2021/12/04 22:42:50.870	INFO	tls.cache.maintenance	started background certificate maintenance	{"cache": "0xc0002b8000"}
2021/12/04 22:42:50.871	INFO	http	server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS	{"server_name": "master", "https_port": 443}
2021/12/04 22:42:50.871	INFO	http	enabling automatic HTTP->HTTPS redirects	{"server_name": "master"}
2021/12/04 22:42:50.872	INFO	tls	cleaning storage unit	{"description": "FileStorage:/tmp/caddy-bug"}
2021/12/04 22:42:50.872	INFO	tls	finished cleaning storage units
2021/12/04 22:42:50.873	INFO	http	enabling automatic TLS certificate management	{"domains": ["nxdomain.test.webmsg.me"]}
2021/12/04 22:42:50.875	INFO	tls.obtain	acquiring lock	{"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:42:50.876	INFO	autosaved config (load with --resume flag)	{"file": "/root/.config/caddy/autosave.json"}
2021/12/04 22:42:50.876	INFO	serving initial configuration
2021/12/04 22:42:50.876	INFO	watcher	watching config file for changes	{"config_file": "/home/postmaster/bug/caddy-bug.json"}
2021/12/04 22:42:51.418	INFO	tls.obtain	lock acquired	{"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:42:51.426	INFO	tls.issuance.acme	waiting on internal rate limiter	{"identifiers": ["nxdomain.test.webmsg.me"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": ""}
2021/12/04 22:42:51.427	INFO	tls.issuance.acme	done waiting on internal rate limiter	{"identifiers": ["nxdomain.test.webmsg.me"], "ca": "https://acme-v02.api.letsencrypt.org/directory", "account": ""}
2021/12/04 22:42:54.259	INFO	tls.issuance.acme.acme_client	trying to solve challenge	{"identifier": "nxdomain.test.webmsg.me", "challenge_type": "http-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2021/12/04 22:42:55.346	ERROR	tls.issuance.acme.acme_client	challenge failed	{"identifier": "nxdomain.test.webmsg.me", "challenge_type": "http-01", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}}
2021/12/04 22:42:55.348	ERROR	tls.issuance.acme.acme_client	validating authorization	{"identifier": "nxdomain.test.webmsg.me", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}, "order": "https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777106950", "attempt": 1, "max_attempts": 3}
2021/12/04 22:42:57.050	INFO	tls.issuance.acme.acme_client	trying to solve challenge	{"identifier": "nxdomain.test.webmsg.me", "challenge_type": "tls-alpn-01", "ca": "https://acme-v02.api.letsencrypt.org/directory"}
2021/12/04 22:42:57.761	ERROR	tls.issuance.acme.acme_client	challenge failed	{"identifier": "nxdomain.test.webmsg.me", "challenge_type": "tls-alpn-01", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}}
2021/12/04 22:42:57.765	ERROR	tls.issuance.acme.acme_client	validating authorization	{"identifier": "nxdomain.test.webmsg.me", "problem": {"type": "urn:ietf:params:acme:error:dns", "title": "", "detail": "DNS problem: NXDOMAIN looking up A for nxdomain.test.webmsg.me - check that a DNS record exists for this domain", "instance": "", "subproblems": []}, "order": "https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777112450", "attempt": 2, "max_attempts": 3}
2021/12/04 22:43:00.356	ERROR	tls.obtain	could not get certificate from issuer	{"identifier": "nxdomain.test.webmsg.me", "issuer": "acme-v02.api.letsencrypt.org-directory", "error": "[nxdomain.test.webmsg.me] solving challenges: nxdomain.test.webmsg.me: no solvers available for remaining challenges (configured=[http-01 tls-alpn-01] offered=[http-01 dns-01 tls-alpn-01] remaining=[dns-01]) (order=https://acme-v02.api.letsencrypt.org/acme/order/307839760/44777117970) (ca=https://acme-v02.api.letsencrypt.org/directory)"}
2021/12/04 22:43:00.360	WARN	tls.issuance.zerossl	missing email address for ZeroSSL; it is strongly recommended to set one for next time
^C2021/12/04 22:43:00.411	INFO	shutting down	{"signal": "SIGINT"}
2021/12/04 22:43:00.412	WARN	exiting; byeee!! 👋	{"signal": "SIGINT"}
2021/12/04 22:43:00.417	INFO	tls.cache.maintenance	stopped background certificate maintenance	{"cache": "0xc0002b8000"}
2021/12/04 22:43:00.418	ERROR	tls.obtain	could not get certificate from issuer	{"identifier": "nxdomain.test.webmsg.me", "issuer": "acme.zerossl.com-v2-DV90", "error": "account pre-registration callback: performing EAB credentials request: Post \"https://api.zerossl.com/acme/eab-credentials-email\": context canceled"}
2021/12/04 22:43:00.420	ERROR	tls.obtain	will retry	{"error": "[nxdomain.test.webmsg.me] Obtain: account pre-registration callback: performing EAB credentials request: Post \"https://api.zerossl.com/acme/eab-credentials-email\": context canceled", "attempt": 1, "retrying_in": 60, "elapsed": 8.999581352, "max_duration": 2592000}
2021/12/04 22:43:00.420	INFO	tls.obtain	releasing lock	{"identifier": "nxdomain.test.webmsg.me"}
2021/12/04 22:43:00.421	INFO	admin	stopped previous server	{"address": "tcp/localhost:2019"}
2021/12/04 22:43:00.421	INFO	shutdown complete	{"signal": "SIGINT", "exit_code": 0}

@mildred
Copy link
Author

mildred commented Dec 4, 2021

Got an easy reproducer, see first comment.

I believe I ran out of disk space at some point which caused bad lock state. I increased since then the VM disk size, but the bad state was kept in the lock files.

edit: The reason it did not happen every time is that the offending domain name is added later on via the admin API via an external script that needs to run. Or Caddy needs to restore a config with the offending domain.

@francislavoie
Copy link
Member

francislavoie commented Dec 5, 2021

Okay so it seems to me like there's an infinite loop that needs fixing.

I think it's in certmagic, not Caddy, actually. If you go searching for that error message in the code, you should be able to find it. (I'm away for the weekend, or I'd take closer look myself right now)

@mildred
Copy link
Author

mildred commented Dec 5, 2021

I didn't found the message in Caddy's code, it's probably in a dependency, yes

@francislavoie
Copy link
Member

@mholt
Copy link
Member

mholt commented Dec 13, 2021

Thanks for the added details, I will circle back around to this when I get a chance! (or someone else can beat me to it, of course)

@mholt mholt added bug 🐞 Something isn't working help wanted 🆘 Extra attention is needed upstream ⬆️ Relates to some dependency of this project and removed needs info 📭 Requires more information labels Dec 13, 2021
@mholt
Copy link
Member

mholt commented Jan 5, 2022

We aren't checking the error from removeLockFile() here after that log message is emitted:

https://github.com/caddyserver/certmagic/blob/9b2c6ac88468351d4cd3aa9b3f378b6da7be06e4/filestorage.go#L163-L168

So it's possible that the removal is failing and causing an infinite loop.

However, the removeLockFile() function creates those .unlock files as you reported earlier, and if those are left hanging around, then we could have another problem because it signals that something else is deleting the lock currently, so we just return right away and iterate the loop. That is probably causing the infinite loop in this case.

It looks like in the original implementation of this atomic unlocking, the point of it being atomic was to clean up empty dirs after removing a lock, but now the caller does that in a separate function of CleanStorage(). So I wonder if we don't even need atomic unlocking anymore. We might be able to simply do os.Remove() and be ok 99.99~% of the time.

I do not know of a way to do an atomic remove that doesn't involve creating more state that can be interrupted, like was the case here. It's rare, yes; but a lingering .unlock file is the same as a lingering .lock file...

Basically, I'm thinking it might be best to just change the "atomic unlock" to a regular os.Remove() call. Yes, this does mean that if two instances see the stale lock at the exact same time, and one is able to delete it AND create the new one before the other one is able to delete it, then the new lock will be deleted. I view this as probably even rarer than the original bug, and while it would result in two instances doing something at the same time, that's less detrimental for certificates than an infinite loop is.

Thoughts?

@mholt
Copy link
Member

mholt commented Jan 6, 2022

I went ahead and committed a fix that I hope will work. @mildred can you give it a try? The mutual exclusion guarantees are weaker now in the presence of stale locks, but I suspect races on those will be rarer than the infinite looping you were experiencing (and less problematic too).

@mholt mholt removed the help wanted 🆘 Extra attention is needed label Jan 6, 2022
crccw pushed a commit to crccw/certmagic that referenced this issue Feb 23, 2022
Should fix caddyserver/caddy#4448

Weaker mutual exclusion guarantees, but probably the better alternative
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working upstream ⬆️ Relates to some dependency of this project
Projects
None yet
Development

No branches or pull requests

4 participants