-
Notifications
You must be signed in to change notification settings - Fork 23.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSL implementation on all channels #4855
Conversation
Hello and thanks for this PR. The code looks very good. I'm going to release RC1 of 5.0, this means that this is the right moment to merge something like that into unstable. I'm leaving for SF tomorrow, when I'll be back I'll start a review and provide feedbacks. First feedback btw:
This function looks odd because it both handlers are set it returns just the first, so it has a lack of generality AFAIK. This is a minor concern of course, but just popped out scanning the code. |
Finally getting a chance to look at this, as I was working on a similar set of changes set to release in a few weeks. Questions:
Comments:
|
Hey Josiah, Really appreciate the review! Answering your questions:
Stability is the main reason, we (AWS) will support bumping the version to 1.1.x when it becomes more stable.
I did look over it at one point, the changes are relatively minor. There is some x509 parsing done in ssl.c that needs updating. There will also be some changes to the download and Makefile. If this get's documented I was planning to fully document it and open an issue. |
Hello all, an updated on the merge schedule. The patch will be reviewed between 2nd and 2rd of May and I hope I'll merge everything straight away (very likely I'll merge even if the implementation will evolve, since anyway is opt-in code and is already a big jump forward). More news ASAP! |
Manually downloading 1.1.0h and using it instead seems to work without any other changes, though I keep getting 1 test failure with Also got a chance to benchmark this; and it's about 5-10% faster in both single and pipelined requests for SSL than my code in the few examples I've tested. e.g. NOSSL/MYSSL/YOURSSL - 120k / 70k / 75k - 480k / 315k / 340k. So your SSL is still a huge win in basically every scenario I posted about last Thursday, nice. :) And your connection latency is better too, I really wish I knew what happened inside s2n to make that happen; I've been fighting initial connection latency for weeks. :P Digging through more, I had added a couple extra buffers to handle SSL vs. plain, and your macros eliminated your need to do the same. So you're doing fewer copies, had to touch fewer integration points, etc. I really like this PR. |
Hi Salvatore, As discussed at RedisConf18 with you, we would love to get this feature soon in core Redis. On your message about updated merge schedule above, did you mean "patch will be reviewed between 2nd and 3rd of May? |
@vkasar Yes sorry I delayed in order to consolidate RESP3 / client side caching, I hope to start the review tomorrow morning, we planned this activity with @artix75 for tomorrow afternoon and will do a first four-eyes review of the entire patch, writing feedbacks. However while we are here, I need urgently a contact email for the Amazon Redis team for another code section of Redis we can collaborate about, please could you give me one at antirez / gmail? Thanks. |
Sent my and Kevin's email your gmail. Feel free to send it to either of us and we can loop in other engineers as necessary. |
@antirez Thanks for your plan to review the patch soon. I do not work for Amazon but you should have the requested information thanks to Madelyn :-) |
Hello, first feedbacks :-) We are starting to approach it as users, we noticed a few things that could be improved:
All the above can be easily fixed, but just to provide incremental feedbacks, this is the first things that we found with @artix75 while starting the review. We are going to check the code. |
Quick question, we understand that now the nodes have an hostname endpoint attribute which is not an IP like in the non-SSL Redis Cluster. Is this something specific of SSL that in order to work must be associated with a DNS name that can be resolved, or is it still technically possible to use bare IP addresses? Thanks. |
[We are sorry for the many questions and comments, but as we found new things we comment] There is a new function clusterClientSetup(), this is refactored from the old code inside clusterCron(). The function uses a different logic depending on the fact SSL is enabled or not: notably the old ping time is not restored if the connection is SSL. This however has the problem that a continuously disconnected node can never be sensed as failing. From the comment what I understand is that the reason for this is that SSL is slow to connect, so if we restore the old ping time, the PING timeouts and we falsely detect the connection as failing. However it looks like that the problem here is that Redis-SSL should use larger timeout times, not that we should change the logic. The problem is that after such a change:
Does this makes sense? Our proposal is to restore the original code to have the same behavior with SSL, and instead warning the user at Redis Cluster startup if they setup a timeout period which is too short compared to SSL reconnection times. |
Errata corrige: C99 actually has false/true/bool apparently including the proper header, however we don't use it in the Redis source code. Probably GCC automatically has this include from some other header, while clang does not have it. Btw I already fixed those few places in my local branch so no problem. |
I've a question about
So I guess that there is the possibility that the socket is double freed or alike, and we want to avoid that it creates problems? After all calling close(fd) multiple times in the same file descriptor can be also very dangerous but is not an immediate crash. However in such case, isn't it better to just make |
We are trying to find our way among the PR right now :-) However what we clearly understood so far is that it is not possible for us to merge this PR before 5.0 goes life, and this must be merged later into unstable. There are several reasons for this that I would like to summarize here, given the popularity of this PR:
So we are going to continue to review and make comments here in the days following the RC1 release (we believe in 1 or 2 weeks), in order to get near and near to the merge. However we also wonder if it is possible to have a conversation about the alternatives. For instance, the same SSL implementation that works as a thread that proxies Redis connections via an UNIX socket would allow perfect separation of the code base. Is such approach so detrimental to be impossible to use, or is actually viable? And so forth. Or, as a middle ground, maybe it is possible to limit the places where the code is ifdeffed, and instead leave the fields and have a global ssl_enabled parameter, and higher level functions that do different things based on the fact SSL is used or not, and so forth. Probably the proposed approach in this PR is the right one, but we want to understand the alternatives a little better: if SSL is going to happen in Redis in the next weeks, we would like it to happen in the best possible way. Certain effects of the PR in the source code are a bit discouraging. Thank you again for all this work, we'll write more comments in this PR soon. |
One feature of SSL is being able to perform hostname verifications. The FQDN of the server is in the certificate that it presents during the SSL handshake. As a client, I want to verify that the certificate being presented is for the server I'm trying to connect with. Also, moving to using FQDNs - especially in the cluster nodes.conf file is useful in a container environment. This because if a Redis container restarts it could very well be assigned a new IP address. My understanding is that will cause issues. Assuming static IP addresses is most any application is not a good thing. |
Hiya, thanks for the feedback, and sorry for taking so long to get back around to this. Stuff that was fixed in the latest commit
Stuff that isn't quite fixed yet
Stuff that was left the same:
I think that covers the minor stuff, to the bigger questions.
That solution will work, since it is effectively doing the same work that SSL proxies are doing today. It adds additional buffers and syscalls which will impact latency. It could be better for throughput if the process is bottle necked on CPU and if there are free cores available on the machine.
I think this is the good compromise. When submitting this pull request, we had a bit of discussion on how much/where to ifdef. There is probably only a couple of places ifdefs are strictly required if we made the cluster bus change: wrapping ssl.c/h since it contains s2n calls, the server macros because they reference ssl.c/h and preventing enable-ssl if it's not built. Everything else could check ssl at runtime. We were hoping to limit the impact in this request, and then peel off the ifdefs as we gain confidence in the implementation. We are running similar code same code today in ElastiCache, but there are more use cases then we currently allow and want to make sure it's robust. I think @ccs018 answered the FQDN question well. Let me know if I missed something, and thanks again for considering this request :D |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revision 2
deps/Makefile
Outdated
cd openssl && $(SSL_CONFIG) | ||
cd openssl && $(MAKE) depend | ||
cd openssl && $(MAKE) | ||
cd openssl && $(MAKE) install |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can overwrite / blow away system libraries and docs. May want to remove this line.
SSL/TLS support is added to make Redis input working with services like AWS Elasticache with Encryption-in-transit turned on, Redis with stunnel or Redis with native SSL support (redis/redis#4855).
Super excited about this potential functionality in upcoming Redis versions, may I ask in which state this PR right now? |
Thank you for your work @madolson. FWIW, this is exactly my experience with using Redis via Stunnel and Hitch; more interrupts, more memcopies, worse performance. And is why I spent about a week benchmarking my version before being late to the party. There is a way of walking around the problem by using a bit of shared memory directly instead of unix domain sockets, but now you've got multi-process queues via shared memory, still have to memcopy into the ssl context, and you've just re-implemented user-mode networking / dynamic linking / and/or unix domain sockets. What about spawning an in-process thread? Of the 4x Redis + SSL versions I have personally written, direct-inline (or via define) like @madolson's original patch is consistently the most performant. Even internal to a single process, both threaded Redis + SSL implementations I've written have been slower than the non-threaded versions, and @madolson's was faster than all four of mine. I could have sworn I recently saw reference to a paper on threading for offloading networking inside of Redis, and how they managed to get improvements, but I'm not finding it at the moment (and I feel like I'd have read the paper just to see what they did that I can learn from). Having been using a downstream version of @madolson's PR since May last year, about the only issue I've had is that if enough sockets are in TIME_WAIT, your Redis can panic with an errno 0 (non-error) in a return from OpenSSL 1.1.0 (I moved forward) to S2N after 5-6 hours. Inserting a 'sleep 30' between 90-120 second test runs seems to have worked around the issue for me. YMMV. |
Thanks for that info Josiah! I also want to add that I think one of the main concerns is the intrusiveness of the code and I think there are some ways we can modify to make it less intrusive.
Thanks! |
Hello @madolson, I was not yet able to read the PR but I read your comments, and my feeling is that is great that this approach lead to a great simplification. About the CPU usage, it is non trivial because there are four syscalls involved in the SSL thread for every query, so I expect that to be almost completely CPU spent inside the kernel. However there is to consider that when Redis is busy doing more expensive queries, the IOPs will be higher with this approach since the SSL part is no longer handled by the main thread, but yep, in the base case overall the total CPU usage is rising considerably. I'm not concerned about the change in operations per second because that is within a tolerable margin. I've the feeling that the original PR cannot be improved in a significant way because anyway it is based on the idea of hooking, so I believe we should try to go forward with one of the following approaches:
About "2", if we think solely at the clients SSL for now (it is possible that we could solve replication / cluster with a stream cipher and a shared secret, because anyway there is a trust relationship between the nodes), do you think such approach would work? Note that this hook would be specific for the client code inside networking.c when the client reads or writes to the socket. That would allow a similar separation of concerns, without resorting to the side thread, so the CPU usage should be the same as your original PR. Thanks. |
Thanks for the response @antirez! I actually think the ops/sec for the threaded approach is fine, my concern was with how much higher the CPU usage was. If someone is running this in the cloud, they will basically need to pay twice as much if their system is bottlenecked on CPU. I think that is a very big price to pay for SSL, and I would like to be sure it is really a lot simpler before committing to it. I do think it simplifies quite a few edge cases though. For #2, I'm imagining something like defining:
I think this is sufficient for isolating off the ssl components for a single server. I don't think your solution can cover replication though. If we are only covering client connections, the slave will reach out to the master and be expected to start an SSL handshake before the master knows that the slave wants to do replication. We would need to alter the flow of replication by exposing some non-ssl port for it to do symmetric key encryption. This will also fail for the migrate command. I think to handle it in a moduleish way, all of the non-cluster communication needs to be supported with the hooks. Cluster could then be its own subsystem with symmetric encryption. I also want to understand the motivation of why you think the module approach is better than the current implementation. Both of them would rely on hooks to handle SSL, then resume the flow of the code once SSL has been handled. I understand today that is pretty intrusive in our implementation, but we could make it less intrusive with some of the suggestions I mentioned earlier. This should make the code about as easy to read as the module proposal, and it can indirectly rely on a library that supports SSL. I also still think #4 that I mentioned from above has merit, having the proxy for replication + cluster and make the necessary changes in networking.c so clients were handled directly (Either through modules or simplification of what is currently implemented). This avoids the bulk of the performance issues seen in the clients and simplifies the complicated touch points. Would you be able to make some time to talk about this real time? I can stay up late some day, it might help reduce some of the back and forth. Thanks, |
@davissp14 There is some interesting work doing TLS in the kernel, https://github.com/torvalds/linux/blob/master/Documentation/networking/tls.txt, which may be able to improve our implementation if we are okay with a hard dependency on linux for TLS. That solves a lot of the weird touchpoints. |
@madolson Have you considered LD_PRELOAD and overriding write() etc? Your approach is much preferred vs proxying so we are considering merging your changes into our multithreaded fork. However transitioning from write() to hwrite() and other intrusiveness has me concerned. With an overriding implementation you can use the fd to lookup your context struct and handle that transparently in the background. |
@JohnSully Hey, sorry I didn't respond right away. Glad you are considering including it! We did meet with Salvatore at RedisConf, and we agreed to an effort to either A) Build all hooks for a module or B) Accept this PR before Redis 6.0 is released. I'll be posting another draft this week to both bring it up to the tip of unstable and try to make the code cleaner. We didn't look at preload, but we looked into doing gcc wrap, which accomplishes a similar thing. There is a minor caveat where you can't call SSL read/write on non-networking fds. We were trying to be cautious in our approach so we crash on network calls where we haven't set up SSL. We can address this though and I likely will do so in the next revision. |
This is more or less up to date again. Redis-cli now no longer supports SSL, so if you want to check that our I encourage you to also pull https://github.com/madolson/redis/tree/dev-unstable-ssl-original which is now the original pull request. The major changes from last time are:
|
@@ -1256,7 +1265,7 @@ void readSyncBulkPayload(aeEventLoop *el, int fd, void *privdata, int mask) { | |||
if (eof_reached) { | |||
int aof_is_enabled = server.aof_state != AOF_OFF; | |||
|
|||
/* Ensure background save doesn't overwrite synced data */ | |||
/* Ensure background save doesn't ovewrite synced data */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
I had a chance to experiment with this change and the potential to have it as a module. A module feels like a better option because its a lot less intrusive than this change. It requires 5 new module APIs that don't yet exist. To work around that I'm hotpatching the binary with the module. Module is here: https://github.com/JohnSully/modssl/ All we need to do is allow modules to add/remove AE handlers and hook the following APIs:
|
I've made a demo of the APIs needed to make this a module here: #6140 The modssl project is also updated to use them here: https://github.com/JohnSully/modssl/tree/NewApi |
Finally I got to review this and have a few comments:
For example, when reading from a TLS connection we may receive a renegotiation (TLS 1.2) or key update (TLS 1.3) request which must be replied to before any further application data can be read. OpenSSL implements this with I wrote a small test client that validates this. Note that I did later realize s2n does not support TLS 1.2 renegotiations and completely doesn't support TLS 1.3 (and it's revamped key update), but you would still expect the server to at least respond with a TLS alert and not leave the client dangling.
|
Really appreciate the review @yossigo! Commenting here, are you going to attend the monthly sync, if so we could also follow up about some of these items there.
|
@madolson I'll reply here to keep this discussion public.
|
Closing, since there was a different PR. |
Is there a link to the other PR? |
I guess this one? #6236 |
See SSL_README.md for more information:
make test
BUILD_SSL=yes make test
BUILD_SSL=yes make test-ssl
pass 48/48 tests for all 3 combinations
\o/ All tests passed without errors!
The cluster tests also generally pass, redis-trib doesn't fully support ssl though which causes some failures.