Skip to content

Openresty process becoming unresponsive. Possibly lock related? #136

@domharrington

Description

@domharrington

Hey! Sorry to bother you again, I think the instability that I had the other day (#135) was a red herring and not actually the cause of the outages that we've been having.

I've been seeing some things in our logs that look like the following:

Restarting selected processes ['openresty']

This is coming from supervisor detecting that openresty has become unresponsive. Scrolling to those logs, i see lots of things regarding a "shared memory zone" (i've redacted some duplicates for brevity):

[notice] 24883#24883: signal 15 (SIGTERM) received, exiting
[notice] 24883#24883: signal 14 (SIGALRM) received
[notice] 24883#24883: signal 17 (SIGCHLD) received
[alert] 24883#24883: worker process 24895 exited on signal 9
[alert] 24883#24883: shared memory zone "auto_ssl" was locked by 24895
[notice] 24883#24883: exit
[notice] 31640#31640: using the "epoll" event method
[notice] 31640#31640: openresty/1.11.2.2
[notice] 31640#31640: built by gcc 4.9.2 (Debian 4.9.2-10)
[notice] 31640#31640: OS: Linux 4.4.0-96-generic
[notice] 31640#31640: getrlimit(RLIMIT_NOFILE): 1048576:1048576
[notice] 31640#31640: start worker processes
[notice] 31640#31640: start worker process 31652
[notice] 31652#31652: signal 17 (SIGCHLD) received
[notice] 31652#31652: unknown process 31653 exited with code 0
[notice] 31652#31652: signal 17 (SIGCHLD) received
[notice] 31652#31652: unknown process 31655 exited with code 0
[notice] 31652#31652: *1 [lua] start_sockproc.lua:10: auto-ssl: starting sockproc, context: init_worker_by_lua*
[notice] 31652#31652: signal 17 (SIGCHLD) received
[info] 31652#31652: waitpid() failed (10: No child processes)

This log output lead me to the following issues:
#43
18F/api.data.gov#325

And a commit which fixed the locking issue: 8bcb9ee

One of the error messages in that commit "failed to unlock:" is also coming up in our logs with the following error message:

[error] 3706#3706: *20191 [lua] ssl_certificate.lua:53: issue_cert_unlock(): auto-ssl: failed to unlock: lock does not match expected value, context: ssl_certificate_by_lua*

Which appears to come from: https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L53

Have you ever seen this before? Is there anything I can do to debug this further? I've already bumped the auto_ssl lua_shared_dict to 100mb in an attempt to mitigate in the chance we were filling it up.

Tracing through the code to find places that may be calling this, I can see that I have some of these errors too: https://github.com/GUI/lua-resty-auto-ssl/blob/0fe6ab67f091d75821a865d8d71122dc04e8ff0d/lib/resty/auto-ssl/ssl_certificate.lua#L77

With timeout being the reason the lock couldnt be created:

[error] 1850#1850: *576 [lua] ssl_certificate.lua:68: issue_cert(): auto-ssl: failed to obtain lock: timeout, context: ssl_certificate_by_lua*

We're using redis as our storage adapter, if that matters? Thanks again for all of your help with this. If there's anything I can do to contribute back to the code for all that you've done, i'd be more than happy to.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions