Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to get length of local 'certs' (a nil value) #48

Closed
gjongenelen opened this issue Feb 23, 2017 · 8 comments
Closed

Attempt to get length of local 'certs' (a nil value) #48

gjongenelen opened this issue Feb 23, 2017 · 8 comments
Labels

Comments

@gjongenelen
Copy link
Contributor

Hi,

I'm using auto-ssl for a while now, but I'm facing some issues regularly.

Once in a while this error comes up. Which results in a handshake error. Restarting the nginx-instances solved the problem. I'm using Redis as storage.

2017/02/23 07:45:01 [error] 25#0: *3177 lua entry thread aborted: runtime error: /usr/local/openresty/lualib/ngx/ocsp.lua:54: attempt to get length of local 'certs' (a nil value) stack traceback: coroutine 0: /usr/local/openresty/lualib/ngx/ocsp.lua: in function 'get_ocsp_responder_from_der_chain' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:123: in function 'get_ocsp_response' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:180: in function 'set_ocsp_stapling' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:210: in function 'set_cert' .../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:252: in function 'ssl_certificate' .../local/openresty/luajit/share/lua/5.1/resty/auto-ssl.lua:70: in function 'ssl_certificate' ssl_certificate_by_lua:2: in function <ssl_certificate_by_lua:1>, context: ssl_certificate_by_lua*, client: 10.42.205.225, server: 0.0.0.0:443 2017/02/23 07:45:01 [crit] 25#0: *3176 SSL_do_handshake() failed (SSL: error:1408A179:SSL routines:ssl3_get_client_hello:cert cb error) while SSL handshaking, client: 10.42.205.225, server: 0.0.0.0:443

What could be going wrong here?

Gino

@GUI
Copy link
Collaborator

GUI commented Feb 25, 2017

So in tracing through the code, I think this should have only been possible if OpenResty's built in ssl.cert_pem_to_der or ssl.priv_key_pem_to_der functions failed for some reason. If those failed, then lua-resty-auto-ssl wasn't properly returning an error, so it would try to proceed with OCSP stapling, which leads to this nil value error.

I've fixed lua-resty-auto-ssl so that it will log and return a proper error when those builtin functions fail (035dcba), and that update is published in the v0.10.4 release. I've also included some fixes in v0.10.4 which should hopefully handle unexpected errors like this more gracefully (so you might not have to fully restart things if an unexpected error like this crops up).

That being said, I'm still not exactly sure why ssl.cert_pem_to_der or ssl.priv_key_pem_to_der would fail (I've never seen that before). So while v0.10.4 should hopefully help debug the real underlying issue (with better error logging), I'm not sure it will solve the real issue. For those built in functions to even be called, there has to be certificate data returned from storage (or a newly issued certificate with data), but then those DER functions fail for some reason. The fact that it works again after restarting nginx is curious, since that would seem to indicate the data stored in Redis is valid. So unless Redis is sometimes returning a partial or corrupted response, this is all a bit odd.

So could you try updating to v0.10.4 and then we can see what the logs say when those built-in functions fail? Sorry for the trouble and hassle, but hopefully with v0.10.4's logging we can get to the bottom of this. If you get better error logs from v0.10.4, could you also provide which version of OpenResty and OpenSSL you're running, and whether this seems to be happening for existing certs, or just when certs are renewed? Thanks!

@domharrington
Copy link

domharrington commented Mar 23, 2017

I think i'm experiencing a similar error to this. Here are my logs:

{"log":"2017/03/23 10:20:16 [notice] 56#56: *1071875 [lua] ssl_certificate.lua:85: issue_cert(): auto-ssl: issuing new certificate for example.com, context: ssl_certificate_by_lua*, client: 172.31.2.109, server: 0.0.0.0:443\n","stream":"stderr","time":"2017-03-23T10:20:16.473051893Z"}
{"log":"2017/03/23 10:20:16 [error] 56#56: *1071875 lua entry thread aborted: runtime error: ...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua:21: attempt to concatenate a nil value\n","stream":"stderr","time":"2017-03-23T10:20:16.473057068Z"}
{"log":"stack traceback:\n","stream":"stderr","time":"2017-03-23T10:20:16.473060427Z"}
{"log":"coroutine 0:\n","stream":"stderr","time":"2017-03-23T10:20:16.473064263Z"}
{"log":"\u0009...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua: in function 'issue_cert'\n","stream":"stderr","time":"2017-03-23T10:20:16.473067343Z"}
{"log":"\u0009.../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:86: in function 'issue_cert'\n","stream":"stderr","time":"2017-03-23T10:20:16.473070704Z"}
{"log":"\u0009.../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:112: in function 'get_cert'\n","stream":"stderr","time":"2017-03-23T10:20:16.47307377Z"}
{"log":"\u0009.../luajit/share/lua/5.1/resty/auto-ssl/ssl_certificate.lua:245: in function 'ssl_certificate'\n","stream":"stderr","time":"2017-03-23T10:20:16.47307742Z"}
{"log":"\u0009.../local/openresty/luajit/share/lua/5.1/resty/auto-ssl.lua:70: in function 'ssl_certificate'\n","stream":"stderr","time":"2017-03-23T10:20:16.473080429Z"}
{"log":"\u0009ssl_certificate_by_lua:2: in function \u003cssl_certificate_by_lua:1\u003e, context: ssl_certificate_by_lua*, client: 172.31.2.109, server: 0.0.0.0:443\n","stream":"stderr","time":"2017-03-23T10:20:16.473083802Z"}
{"log":"2017/03/23 10:20:16 [error] 56#56: *1071875 lua entry thread aborted: runtime error: ...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua:21: attempt to concatenate a nil value\n","stream":"stderr","time":"2017-03-23T10:20:16.47308708Z"}
{"log":"2017/03/23 10:20:16 [crit] 56#56: *1071874 SSL_do_handshake() failed (SSL: error:1408A179:SSL routines:ssl3_get_client_hello:cert cb error) while SSL handshaking, client: 172.31.2.109, server: 0.0.0.0:443\n","stream":"stderr","time":"2017-03-23T10:20:16.473152026Z"}

About 2 minutes after this happens, the docker container becomes totally unresponsive and only a restart fixes it. I'm on v0.10.3-1 of this plugin.

@domharrington
Copy link

domharrington commented Apr 12, 2017

Hey,

I've upgraded to the latest version and now i get the following error in my logs:

*23036 [lua] ssl_certificate.lua:265: ssl_certificate(): auto-ssl: failed to run do_ssl: ...re/lua/5.1/resty/auto-ssl/ssl_providers/lets_encrypt.lua:15: attempt to concatenate a nil value, context: ssl_certificate_by_lua*

This seems to be around time when new certificates are generated.

Edit this appears to be coming from this file: https://github.com/GUI/lua-resty-auto-ssl/blob/f7fe3e62c081d77c7d021fcc10371ebb26c97bc0/lib/resty/auto-ssl/ssl_providers/lets_encrypt.lua#L15 maybe hook_port is nil somehow? I can't see how that would be the case as the asserts above should be covering for this: https://github.com/GUI/lua-resty-auto-ssl/blob/f7fe3e62c081d77c7d021fcc10371ebb26c97bc0/lib/resty/auto-ssl/ssl_providers/lets_encrypt.lua#L10-L11

Edit 2 I'm also not overriding the hook_server_port so this should be always set to be the default https://github.com/GUI/lua-resty-auto-ssl/blob/e994c8cc5d00100401dbd0138180bd368b80d34c/lib/resty/auto-ssl.lua#L43-L45

GUI added a commit that referenced this issue Apr 17, 2017
If the hook secret was nil, then it was leading to errors, but the
backtraces were a bit odd, since they weren't reporting the right line
number: #48

Still need to figure out a better way to ensure hook_secret doesn't
disappear if the shared dict gets full, but this at least improves the
errors to make things easier to understand.
GUI added a commit that referenced this issue Apr 17, 2017
If the shared dict ran out of memory, then the hook secret could end up
getting removed, leading to errors like this:
#48 (comment)

By storing the hook secret as a global variable, it should not longer be
subject to getting removed from the shared dict.
GUI added a commit that referenced this issue Apr 17, 2017
This should more generally help with potential issues seen in
#48 where old items
might get removed from the shdict if there's too many certificates for
the configured amount of memory. By logging this as an error, it will
hopefully make debugging and alerting to the real issue easier (need to
increase shdict memory).

Since we don't store duplicate certs for domains, and the certificate
data is stored with a TTL, I think this should be okay to log at the
error level, since if this happens, thing's aren't going well, since
there's not enough memory for active domains.
@GUI
Copy link
Collaborator

GUI commented Apr 17, 2017

@domharrington: Thanks for reporting this and for the details! I think I was able to track down what was happening. The line number on the error backtrace was a bit off for some reason, but the culprit was hook_secret being nil. The only way I think this could happen is if the shdict used for caching certs data in memory ran out of space. I think this in turn could lead to things becoming unresponsive if new certificates were being registered (because without the secret, the internal communication becomes rather broken).

So if this theory is correct, you might need to increase the configured size of lua_shared_dict auto_ssl in your nginx config. I estimated that for each 1MB, you can store 100 certificates, but that estimate might be off. But does this seem like it could be the issue in your setup, or do you not have that many certificates for this to make sense?

In any case, I've released v0.10.6, which I'm hoping should fix this in a couple different ways:

  • The hook secret is stored in a different way, so it shouldn't ever be subject to getting evicted from the cache, like it was before.
  • If the lua_shared_dict used for caching data begins to run out of memory, more explicit errors are logged to nginx's error log, suggesting that you raise the configured amount (but the server should still be operational when this happens due to the first change--just less efficient).

I'm hoping those updates in v0.10.6 should make things less fatal and easier to track down if your configured amount of memory is too low. This might also explain some other unresponsive issues a couple other people reported, but we were never able to get to the bottom of, so thanks again for the report.

Caveat: I'm not entirely certain this error is related to the original issue @gjongenelen reported.

@GUI GUI mentioned this issue Apr 17, 2017
@domharrington
Copy link

Hey @GUI thank you very much for investigating this and coming up with a potential solution.

We do currently have lua_shared_dict set to 1m, and we a few hundred certs. I will try bumping this limit now and hopefully that will solve the problem!

Thank you very much again for taking the time to look into this. I will upgrade to 0.10.6 now.

Dom

@luto
Copy link
Collaborator

luto commented May 31, 2017

@domharrington did you experience any more issues after bumping the limit and upgrading to the latest version?

@domharrington
Copy link

@luto nope. It's been very stable since deploying the latest update.

@GUI
Copy link
Collaborator

GUI commented Jun 1, 2017

Great, thank you for confirming! Going to go ahead and close this then.

@GUI GUI closed this as completed Jun 1, 2017
GUI added a commit that referenced this issue Jun 18, 2017
If the `auto_ssl` shared dict ran out of memory, and then nginx were
reloaded, then a race condition existed where multiple "sockproc"
processes could try to be started at the same time.

While I think this situation would be unlikely to affect a production
system in a negative way (since the race condition only occurs during
nginx reloads and 1 of the sockproc processes should still succeed
and allow things to work), it did lead to some errors being logged,
which would intermittently cause our tests to fail (it would crop up on
the test following our t/memory.t test, since that next test would be
the first reload following our test to explicitly exhaust the memory).

This is fixed by using the `auto_ssl_settings` shared dict for storing
the resty-lock details (the lock prevents multiple processes from being
started at once). This smaller shared dict (introduced in #68) is used
for storing bits of data that won't grow in size so we can better ensure
the data will never be evicted from the cache. I'm now able to
repeatedly run the test suite in a loop without hitting this edge case.

Note that we are still using the `auto_ssl` shared dict for storing
resty-lock details for domain registrations, since the memory
requirements for that may grow (since there's a lock per domain, it's
dynamic in size). But that should be okay, because similar to the SSL
certs stored in `auto_ssl`, we're okay with cache evictions for old data
in those cases (along with warnings being logged).

Also possibly relevant is that currently resty-lock always uses `add`
for the shared dict (so it will evict old data to make room if
necessary), but there's a pull request for allowing use of `safe_add`:
openresty/lua-resty-lock#6 While we should be
okay by switching things to `auto_ssl_settings` (since we should never
have enough stored data in there to need evicting old data), it's
something to keep an eye on.

Related to:
- #68 (comment)
- #48 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants