Attempt to get length of local 'certs' (a nil value) #48

gjongenelen · 2017-02-23T07:50:34Z

Hi,

I'm using auto-ssl for a while now, but I'm facing some issues regularly.

Once in a while this error comes up. Which results in a handshake error. Restarting the nginx-instances solved the problem. I'm using Redis as storage.

2017/02/23 07:45:01 [error] 25#0: *3177 lua entry thread aborted: runtime error: /usr/local/openresty/lualib/ngx/ocsp.lua:54: attempt to get length of local 'certs' (a nil value) stack traceback: coroutine 0: /usr/local/openresty/lualib/ngx/ocsp.lua: in function 'get_ocsp_responder_from_der_chain' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:123: in function 'get_ocsp_response' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:180: in function 'set_ocsp_stapling' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:210: in function 'set_cert' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:252: in function 'ssl_certificate' .../local/openresty/luajit/share/lua/5.1/resty/auto-ssl.lua:70: in function 'ssl_certificate' ssl_certificate_by_lua:2: in function <ssl_certificate_by_lua:1>, context: ssl_certificate_by_lua*, client: 10.42.205.225, server: 0.0.0.0:443 2017/02/23 07:45:01 [crit] 25#0: *3176 SSL_do_handshake() failed (SSL: error:1408A179:SSL routines:ssl3_get_client_hello:cert cb error) while SSL handshaking, client: 10.42.205.225, server: 0.0.0.0:443

What could be going wrong here?

Gino

The text was updated successfully, but these errors were encountered:

GUI · 2017-02-25T16:37:06Z

So in tracing through the code, I think this should have only been possible if OpenResty's built in ssl.cert_pem_to_der or ssl.priv_key_pem_to_der functions failed for some reason. If those failed, then lua-resty-auto-ssl wasn't properly returning an error, so it would try to proceed with OCSP stapling, which leads to this nil value error.

I've fixed lua-resty-auto-ssl so that it will log and return a proper error when those builtin functions fail (035dcba), and that update is published in the v0.10.4 release. I've also included some fixes in v0.10.4 which should hopefully handle unexpected errors like this more gracefully (so you might not have to fully restart things if an unexpected error like this crops up).

That being said, I'm still not exactly sure why ssl.cert_pem_to_der or ssl.priv_key_pem_to_der would fail (I've never seen that before). So while v0.10.4 should hopefully help debug the real underlying issue (with better error logging), I'm not sure it will solve the real issue. For those built in functions to even be called, there has to be certificate data returned from storage (or a newly issued certificate with data), but then those DER functions fail for some reason. The fact that it works again after restarting nginx is curious, since that would seem to indicate the data stored in Redis is valid. So unless Redis is sometimes returning a partial or corrupted response, this is all a bit odd.

So could you try updating to v0.10.4 and then we can see what the logs say when those built-in functions fail? Sorry for the trouble and hassle, but hopefully with v0.10.4's logging we can get to the bottom of this. If you get better error logs from v0.10.4, could you also provide which version of OpenResty and OpenSSL you're running, and whether this seems to be happening for existing certs, or just when certs are renewed? Thanks!

domharrington · 2017-03-23T18:55:56Z

I think i'm experiencing a similar error to this. Here are my logs:

{"log":"2017/03/23 10:20:16 [notice] 56#56: *1071875 [lua] ssl_certificate.lua:85: issue_cert(): auto-ssl: issuing new certificate for example.com, context: ssl_certificate_by_lua*, client: 172.31.2.109, server: 0.0.0.0:443\n","stream":"stderr","time":"2017-03-23T10:20:16.473051893Z"}
{"log":"2017/03/23 10:20:16 [error] 56#56: *1071875 lua entry thread aborted: runtime error: ...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua:21: attempt to concatenate a nil value\n","stream":"stderr","time":"2017-03-23T10:20:16.473057068Z"}
{"log":"stack traceback:\n","stream":"stderr","time":"2017-03-23T10:20:16.473060427Z"}
{"log":"coroutine 0:\n","stream":"stderr","time":"2017-03-23T10:20:16.473064263Z"}
{"log":"\u0009...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua: in function 'issue_cert'\n","stream":"stderr","time":"2017-03-23T10:20:16.473067343Z"}
{"log":"\u0009.../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:86: in function 'issue_cert'\n","stream":"stderr","time":"2017-03-23T10:20:16.473070704Z"}
{"log":"\u0009.../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:112: in function 'get_cert'\n","stream":"stderr","time":"2017-03-23T10:20:16.47307377Z"}
{"log":"\u0009.../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:245: in function 'ssl_certificate'\n","stream":"stderr","time":"2017-03-23T10:20:16.47307742Z"}
{"log":"\u0009.../local/openresty/luajit/share/lua/5.1/resty/auto-ssl.lua:70: in function 'ssl_certificate'\n","stream":"stderr","time":"2017-03-23T10:20:16.473080429Z"}
{"log":"\u0009ssl_certificate_by_lua:2: in function \u003cssl_certificate_by_lua:1\u003e, context: ssl_certificate_by_lua*, client: 172.31.2.109, server: 0.0.0.0:443\n","stream":"stderr","time":"2017-03-23T10:20:16.473083802Z"}
{"log":"2017/03/23 10:20:16 [error] 56#56: *1071875 lua entry thread aborted: runtime error: ...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua:21: attempt to concatenate a nil value\n","stream":"stderr","time":"2017-03-23T10:20:16.47308708Z"}
{"log":"2017/03/23 10:20:16 [crit] 56#56: *1071874 SSL_do_handshake() failed (SSL: error:1408A179:SSL routines:ssl3_get_client_hello:cert cb error) while SSL handshaking, client: 172.31.2.109, server: 0.0.0.0:443\n","stream":"stderr","time":"2017-03-23T10:20:16.473152026Z"}

About 2 minutes after this happens, the docker container becomes totally unresponsive and only a restart fixes it. I'm on v0.10.3-1 of this plugin.

domharrington · 2017-04-12T21:00:57Z

Hey,

I've upgraded to the latest version and now i get the following error in my logs:

*23036 [lua] ssl_certificate.lua:265: ssl_certificate(): auto-ssl: failed to run do_ssl: ...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua:15: attempt to concatenate a nil value, context: ssl_certificate_by_lua*

This seems to be around time when new certificates are generated.

Edit this appears to be coming from this file: https://github.com/GUI/lua-resty-auto-ssl/blob/f7fe3e62c081d77c7d021fcc10371ebb26c97bc0/lib/resty/auto-ssl/ssl_providers/lets_encrypt.lua#L15 maybe hook_port is nil somehow? I can't see how that would be the case as the asserts above should be covering for this: https://github.com/GUI/lua-resty-auto-ssl/blob/f7fe3e62c081d77c7d021fcc10371ebb26c97bc0/lib/resty/auto-ssl/ssl_providers/lets_encrypt.lua#L10-L11

Edit 2 I'm also not overriding the hook_server_port so this should be always set to be the default https://github.com/GUI/lua-resty-auto-ssl/blob/e994c8cc5d00100401dbd0138180bd368b80d34c/lib/resty/auto-ssl.lua#L43-L45

If the hook secret was nil, then it was leading to errors, but the backtraces were a bit odd, since they weren't reporting the right line number: #48 Still need to figure out a better way to ensure hook_secret doesn't disappear if the shared dict gets full, but this at least improves the errors to make things easier to understand.

If the shared dict ran out of memory, then the hook secret could end up getting removed, leading to errors like this: #48 (comment) By storing the hook secret as a global variable, it should not longer be subject to getting removed from the shared dict.

This should more generally help with potential issues seen in #48 where old items might get removed from the shdict if there's too many certificates for the configured amount of memory. By logging this as an error, it will hopefully make debugging and alerting to the real issue easier (need to increase shdict memory). Since we don't store duplicate certs for domains, and the certificate data is stored with a TTL, I think this should be okay to log at the error level, since if this happens, thing's aren't going well, since there's not enough memory for active domains.

GUI · 2017-04-17T03:19:16Z

@domharrington: Thanks for reporting this and for the details! I think I was able to track down what was happening. The line number on the error backtrace was a bit off for some reason, but the culprit was hook_secret being nil. The only way I think this could happen is if the shdict used for caching certs data in memory ran out of space. I think this in turn could lead to things becoming unresponsive if new certificates were being registered (because without the secret, the internal communication becomes rather broken).

So if this theory is correct, you might need to increase the configured size of lua_shared_dict auto_ssl in your nginx config. I estimated that for each 1MB, you can store 100 certificates, but that estimate might be off. But does this seem like it could be the issue in your setup, or do you not have that many certificates for this to make sense?

In any case, I've released v0.10.6, which I'm hoping should fix this in a couple different ways:

The hook secret is stored in a different way, so it shouldn't ever be subject to getting evicted from the cache, like it was before.
If the lua_shared_dict used for caching data begins to run out of memory, more explicit errors are logged to nginx's error log, suggesting that you raise the configured amount (but the server should still be operational when this happens due to the first change--just less efficient).

I'm hoping those updates in v0.10.6 should make things less fatal and easier to track down if your configured amount of memory is too low. This might also explain some other unresponsive issues a couple other people reported, but we were never able to get to the bottom of, so thanks again for the report.

Caveat: I'm not entirely certain this error is related to the original issue @gjongenelen reported.

domharrington · 2017-04-17T21:06:35Z

Hey @GUI thank you very much for investigating this and coming up with a potential solution.

We do currently have lua_shared_dict set to 1m, and we a few hundred certs. I will try bumping this limit now and hopefully that will solve the problem!

Thank you very much again for taking the time to look into this. I will upgrade to 0.10.6 now.

Dom

luto · 2017-05-31T11:36:25Z

@domharrington did you experience any more issues after bumping the limit and upgrading to the latest version?

domharrington · 2017-06-01T19:00:28Z

@luto nope. It's been very stable since deploying the latest update.

GUI · 2017-06-01T19:31:26Z

Great, thank you for confirming! Going to go ahead and close this then.

If the `auto_ssl` shared dict ran out of memory, and then nginx were reloaded, then a race condition existed where multiple "sockproc" processes could try to be started at the same time. While I think this situation would be unlikely to affect a production system in a negative way (since the race condition only occurs during nginx reloads and 1 of the sockproc processes should still succeed and allow things to work), it did lead to some errors being logged, which would intermittently cause our tests to fail (it would crop up on the test following our t/memory.t test, since that next test would be the first reload following our test to explicitly exhaust the memory). This is fixed by using the `auto_ssl_settings` shared dict for storing the resty-lock details (the lock prevents multiple processes from being started at once). This smaller shared dict (introduced in #68) is used for storing bits of data that won't grow in size so we can better ensure the data will never be evicted from the cache. I'm now able to repeatedly run the test suite in a loop without hitting this edge case. Note that we are still using the `auto_ssl` shared dict for storing resty-lock details for domain registrations, since the memory requirements for that may grow (since there's a lock per domain, it's dynamic in size). But that should be okay, because similar to the SSL certs stored in `auto_ssl`, we're okay with cache evictions for old data in those cases (along with warnings being logged). Also possibly relevant is that currently resty-lock always uses `add` for the shared dict (so it will evict old data to make room if necessary), but there's a pull request for allowing use of `safe_add`: openresty/lua-resty-lock#6 While we should be okay by switching things to `auto_ssl_settings` (since we should never have enough stored data in there to need evicting old data), it's something to keep an eye on. Related to: - #68 (comment) - #48 (comment)

domharrington mentioned this issue Apr 12, 2017

OCSP Stapling error #56

Closed

GUI mentioned this issue Apr 17, 2017

Locking problem #43

Open

luto added bug feedback needed labels May 31, 2017

GUI closed this as completed Jun 1, 2017

GUI removed the feedback needed label Jun 1, 2017

GUI mentioned this issue Jun 18, 2017

Fix sockproc startup race conditions after running out of memory #76

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to get length of local 'certs' (a nil value) #48

Attempt to get length of local 'certs' (a nil value) #48

gjongenelen commented Feb 23, 2017

GUI commented Feb 25, 2017

domharrington commented Mar 23, 2017 •

edited

Loading

domharrington commented Apr 12, 2017 •

edited

Loading

GUI commented Apr 17, 2017

domharrington commented Apr 17, 2017

luto commented May 31, 2017

domharrington commented Jun 1, 2017

GUI commented Jun 1, 2017

Attempt to get length of local 'certs' (a nil value) #48

Attempt to get length of local 'certs' (a nil value) #48

Comments

gjongenelen commented Feb 23, 2017

GUI commented Feb 25, 2017

domharrington commented Mar 23, 2017 • edited Loading

domharrington commented Apr 12, 2017 • edited Loading

GUI commented Apr 17, 2017

domharrington commented Apr 17, 2017

luto commented May 31, 2017

domharrington commented Jun 1, 2017

GUI commented Jun 1, 2017

domharrington commented Mar 23, 2017 •

edited

Loading

domharrington commented Apr 12, 2017 •

edited

Loading