-
Couldn't load subscription status.
- Fork 184
Description
Hey! Sorry to bother you again, I think the instability that I had the other day (#135) was a red herring and not actually the cause of the outages that we've been having.
I've been seeing some things in our logs that look like the following:
Restarting selected processes ['openresty']
This is coming from supervisor detecting that openresty has become unresponsive. Scrolling to those logs, i see lots of things regarding a "shared memory zone" (i've redacted some duplicates for brevity):
[notice] 24883#24883: signal 15 (SIGTERM) received, exiting
[notice] 24883#24883: signal 14 (SIGALRM) received
[notice] 24883#24883: signal 17 (SIGCHLD) received
[alert] 24883#24883: worker process 24895 exited on signal 9
[alert] 24883#24883: shared memory zone "auto_ssl" was locked by 24895
[notice] 24883#24883: exit
[notice] 31640#31640: using the "epoll" event method
[notice] 31640#31640: openresty/1.11.2.2
[notice] 31640#31640: built by gcc 4.9.2 (Debian 4.9.2-10)
[notice] 31640#31640: OS: Linux 4.4.0-96-generic
[notice] 31640#31640: getrlimit(RLIMIT_NOFILE): 1048576:1048576
[notice] 31640#31640: start worker processes
[notice] 31640#31640: start worker process 31652
[notice] 31652#31652: signal 17 (SIGCHLD) received
[notice] 31652#31652: unknown process 31653 exited with code 0
[notice] 31652#31652: signal 17 (SIGCHLD) received
[notice] 31652#31652: unknown process 31655 exited with code 0
[notice] 31652#31652: *1 [lua] start_sockproc.lua:10: auto-ssl: starting sockproc, context: init_worker_by_lua*
[notice] 31652#31652: signal 17 (SIGCHLD) received
[info] 31652#31652: waitpid() failed (10: No child processes)
This log output lead me to the following issues:
#43
18F/api.data.gov#325
And a commit which fixed the locking issue: 8bcb9ee
One of the error messages in that commit "failed to unlock:" is also coming up in our logs with the following error message:
[error] 3706#3706: *20191 [lua] ssl_certificate.lua:53: issue_cert_unlock(): auto-ssl: failed to unlock: lock does not match expected value, context: ssl_certificate_by_lua*
Which appears to come from: https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L53
Have you ever seen this before? Is there anything I can do to debug this further? I've already bumped the auto_ssl lua_shared_dict to 100mb in an attempt to mitigate in the chance we were filling it up.
Tracing through the code to find places that may be calling this, I can see that I have some of these errors too: https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L77
With timeout being the reason the lock couldnt be created:
[error] 1850#1850: *576 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*
We're using redis as our storage adapter, if that matters? Thanks again for all of your help with this. If there's anything I can do to contribute back to the code for all that you've done, i'd be more than happy to.