New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSL implementation on all channels #4855

Open
wants to merge 2 commits into
base: unstable
from

Conversation

Projects
None yet
6 participants
@madolson
Contributor

madolson commented Apr 19, 2018

See SSL_README.md for more information:

make test
BUILD_SSL=yes make test
BUILD_SSL=yes make test-ssl

pass 48/48 tests for all 3 combinations
\o/ All tests passed without errors!

The cluster tests also generally pass, redis-trib doesn't fully support ssl though which causes some failures.

SSL implementation on all channels
See SSL_README.md for more information

@madolson madolson referenced this pull request Apr 19, 2018

Open

Create RCP12 #20

@ccs018

This comment has been minimized.

ccs018 commented Apr 19, 2018

@madolson madolson referenced this pull request Apr 20, 2018

Open

Redis SSL support #2178

@antirez

This comment has been minimized.

Owner

antirez commented Apr 20, 2018

Hello and thanks for this PR. The code looks very good. I'm going to release RC1 of 5.0, this means that this is the right moment to merge something like that into unstable. I'm leaving for SF tomorrow, when I'll be back I'll start a review and provide feedbacks. First feedback btw:

aeGetFileProc()

This function looks odd because it both handlers are set it returns just the first, so it has a lack of generality AFAIK. This is a minor concern of course, but just popped out scanning the code.

@josiahcarlson

This comment has been minimized.

josiahcarlson commented Apr 30, 2018

Finally getting a chance to look at this, as I was working on a similar set of changes set to release in a few weeks.

Questions:

  • Are you using 1.0.2 instead of 1.1.0 for the LTS support until December 2019? API stability? Something else?
  • Have you looked into the changes necessary to work with 1.1.0? (I'll probably answer this one myself this afternoon)

Comments:

  • Great work with the SSL renegotiation for the diskless replication stuff; I'd planned on passing data to the parent to avoid a renegotiation (wasn't sure how to make it happen with plain OpenSSL). Your solution is way better :)
  • I like your use of rread()/rwrite()/hread()/hwrite() macros, and the interface inside s2n that lets you find out if there's any more data inside / partial packets. I'd been doing an extra read/write in those cases on plain OpenSSL, and it was annoying to get right to prevent IO hangs. This is better code.
@madolson

This comment has been minimized.

Contributor

madolson commented Apr 30, 2018

Hey Josiah,

Really appreciate the review! Answering your questions:

Are you using 1.0.2 instead of 1.1.0 for the LTS support until December 2019? API stability? Something else?

Stability is the main reason, we (AWS) will support bumping the version to 1.1.x when it becomes more stable.

Have you looked into the changes necessary to work with 1.1.0? (I'll probably answer this one myself this afternoon)

I did look over it at one point, the changes are relatively minor. There is some x509 parsing done in ssl.c that needs updating. There will also be some changes to the download and Makefile. If this get's documented I was planning to fully document it and open an issue.

@antirez

This comment has been minimized.

Owner

antirez commented Apr 30, 2018

Hello all, an updated on the merge schedule. The patch will be reviewed between 2nd and 2rd of May and I hope I'll merge everything straight away (very likely I'll merge even if the implementation will evolve, since anyway is opt-in code and is already a big jump forward). More news ASAP!

@josiahcarlson

This comment has been minimized.

josiahcarlson commented Apr 30, 2018

Manually downloading 1.1.0h and using it instead seems to work without any other changes, though I keep getting 1 test failure with unit/wait.tcl regardless of whether I compile with/without openssl, which version of openssl I use, or whether I do make test or make test-ssl. Do you get that failure too?

Also got a chance to benchmark this; and it's about 5-10% faster in both single and pipelined requests for SSL than my code in the few examples I've tested. e.g. NOSSL/MYSSL/YOURSSL - 120k / 70k / 75k - 480k / 315k / 340k. So your SSL is still a huge win in basically every scenario I posted about last Thursday, nice. :) And your connection latency is better too, I really wish I knew what happened inside s2n to make that happen; I've been fighting initial connection latency for weeks. :P

Digging through more, I had added a couple extra buffers to handle SSL vs. plain, and your macros eliminated your need to do the same. So you're doing fewer copies, had to touch fewer integration points, etc. I really like this PR.

@ghost ghost referenced this pull request Apr 30, 2018

Open

client support for tls #176

@benbro benbro referenced this pull request May 4, 2018

Open

SSL/TLS support #176

@vkasar

This comment has been minimized.

vkasar commented May 8, 2018

Hi Salvatore,

As discussed at RedisConf18 with you, we would love to get this feature soon in core Redis. On your message about updated merge schedule above, did you mean "patch will be reviewed between 2nd and 3rd of May?

@antirez

This comment has been minimized.

Owner

antirez commented May 9, 2018

@vkasar Yes sorry I delayed in order to consolidate RESP3 / client side caching, I hope to start the review tomorrow morning, we planned this activity with @artix75 for tomorrow afternoon and will do a first four-eyes review of the entire patch, writing feedbacks.

However while we are here, I need urgently a contact email for the Amazon Redis team for another code section of Redis we can collaborate about, please could you give me one at antirez / gmail? Thanks.

@madolson

This comment has been minimized.

Contributor

madolson commented May 9, 2018

Sent my and Kevin's email your gmail. Feel free to send it to either of us and we can loop in other engineers as necessary.

@vkasar

This comment has been minimized.

vkasar commented May 9, 2018

@antirez Thanks for your plan to review the patch soon.

I do not work for Amazon but you should have the requested information thanks to Madelyn :-)

@antirez

This comment has been minimized.

Owner

antirez commented May 9, 2018

Thanks @madolson, @vkasar :-)

@antirez

This comment has been minimized.

Owner

antirez commented May 10, 2018

Hello, first feedbacks :-)

We are starting to approach it as users, we noticed a few things that could be improved:

  1. Sometimes C++ constructs are used, such as bool, true, false. Redis code base is C99 so this is not allowed, and indeed clang will not compile it.
  2. After following the instructions it is not possible to build the patch on MacOS.
  3. The code does not follow often the Redis convention of writing C code.

All the above can be easily fixed, but just to provide incremental feedbacks, this is the first things that we found with @artix75 while starting the review. We are going to check the code.

@antirez

This comment has been minimized.

Owner

antirez commented May 10, 2018

Quick question, we understand that now the nodes have an hostname endpoint attribute which is not an IP like in the non-SSL Redis Cluster. Is this something specific of SSL that in order to work must be associated with a DNS name that can be resolved, or is it still technically possible to use bare IP addresses? Thanks.

@antirez

This comment has been minimized.

Owner

antirez commented May 10, 2018

[We are sorry for the many questions and comments, but as we found new things we comment]

There is a new function clusterClientSetup(), this is refactored from the old code inside clusterCron(). The function uses a different logic depending on the fact SSL is enabled or not: notably the old ping time is not restored if the connection is SSL. This however has the problem that a continuously disconnected node can never be sensed as failing.

From the comment what I understand is that the reason for this is that SSL is slow to connect, so if we restore the old ping time, the PING timeouts and we falsely detect the connection as failing. However it looks like that the problem here is that Redis-SSL should use larger timeout times, not that we should change the logic. The problem is that after such a change:

  1. Intermittent connections will be sensed as healthy AFAIK.
  2. What is worse is that the connection management of Redis Cluster starts a reconnection on purpose, so it is very likely that we can trig a situation where failures are not detected.

Does this makes sense? Our proposal is to restore the original code to have the same behavior with SSL, and instead warning the user at Redis Cluster startup if they setup a timeout period which is too short compared to SSL reconnection times.

@antirez

This comment has been minimized.

Owner

antirez commented May 10, 2018

Errata corrige: C99 actually has false/true/bool apparently including the proper header, however we don't use it in the Redis source code. Probably GCC automatically has this include from some other header, while clang does not have it. Btw I already fixed those few places in my local branch so no problem.

@antirez

This comment has been minimized.

Owner

antirez commented May 10, 2018

I've a question about getSslConnectionForFd(), basically the function asserts that the fd-to-SSL-handle table is not NULL. This in theory means that there is an hard rule that says that if SSL is enabled, at every FD corresponds an SSL handle. However in other parts of the code I then find checking for NULL:

 void freeClusterLink(clusterLink *link) {
     if (link->fd != -1) {
         aeDeleteFileEvent(server.el, link->fd, AE_READABLE|AE_WRITABLE);
+#ifdef BUILD_SSL
+        if(server.ssl_config.enable_ssl == true){
+            serverAssert((unsigned int)link->fd < server.ssl_config.fd_to_sslconn_size);
+            if(server.ssl_config.fd_to_sslconn[link->fd] != NULL){
+                cleanupSslConnectionForFd(link->fd);
+            }
+        }
+#endif
     }

So I guess that there is the possibility that the socket is double freed or alike, and we want to avoid that it creates problems? After all calling close(fd) multiple times in the same file descriptor can be also very dangerous but is not an immediate crash.

However in such case, isn't it better to just make cleanupSslConnectionForFd() check for NULL and do nothing in that case? And maybe just assert to check that we are inside the range of the conversion table? Thanks.

@antirez

This comment has been minimized.

Owner

antirez commented May 10, 2018

We are trying to find our way among the PR right now :-) However what we clearly understood so far is that it is not possible for us to merge this PR before 5.0 goes life, and this must be merged later into unstable. There are several reasons for this that I would like to summarize here, given the popularity of this PR:

  1. While the PR sounds in general sounding, it is still beta code, and we saw that Redis internals were very SSL hostile in certain ways. Both the fork() and the Cluster architecture forced to have many compromises. The readability of the code was also compromised in many ways, because the SSL handshake process introduced many asynchronous calls where before the code was just synchronous and so forth.
  2. Because of "1", actually even when the patch is not compiled inside, the resulting Redis source code was modified in several places in order to make everything compatible with the SSL ifdef semantics. We want to be really sure that nothing is affected when SSL is compiled out.
  3. When SSL is enabled instead, the mechanics of many important operations like replication was modified in non trivial ways. Because historically this code was fragile to changes, and relies in many non obvious things, we are not sure that the SSL implementation is correct in all the corner cases, and we would like to review it in a more in-depth way. This in-depth review it is not possible right now that we are preparing to ship 5.0, but will be possible in the next weeks.
  4. We are both a bit ignorant about SSL itself, so we are struggling to understand certain things. For instance the fact that Cluster supports both an IP based and an hostname based endpoint system is not clear, maybe we should support just one, or maybe this is the only solution. Because of our ignorance we are not able to evaluate certain things properly, however our ignorance does not justify to blindly merge the PR.

So we are going to continue to review and make comments here in the days following the RC1 release (we believe in 1 or 2 weeks), in order to get near and near to the merge. However we also wonder if it is possible to have a conversation about the alternatives. For instance, the same SSL implementation that works as a thread that proxies Redis connections via an UNIX socket would allow perfect separation of the code base. Is such approach so detrimental to be impossible to use, or is actually viable? And so forth. Or, as a middle ground, maybe it is possible to limit the places where the code is ifdeffed, and instead leave the fields and have a global ssl_enabled parameter, and higher level functions that do different things based on the fact SSL is used or not, and so forth. Probably the proposed approach in this PR is the right one, but we want to understand the alternatives a little better: if SSL is going to happen in Redis in the next weeks, we would like it to happen in the best possible way. Certain effects of the PR in the source code are a bit discouraging.

Thank you again for all this work, we'll write more comments in this PR soon.

@artix75 and @antirez

@ccs018

This comment has been minimized.

ccs018 commented May 10, 2018

Quick question, we understand that now the nodes have an hostname endpoint attribute which is not an IP like in the non-SSL Redis Cluster. Is this something specific of SSL that in order to work must be associated with a DNS name that can be resolved, or is it still technically possible to use bare IP addresses? Thanks.

One feature of SSL is being able to perform hostname verifications. The FQDN of the server is in the certificate that it presents during the SSL handshake. As a client, I want to verify that the certificate being presented is for the server I'm trying to connect with.

Also, moving to using FQDNs - especially in the cluster nodes.conf file is useful in a container environment. This because if a Redis container restarts it could very well be assigned a new IP address. My understanding is that will cause issues. Assuming static IP addresses is most any application is not a good thing.

@madolson

This comment has been minimized.

Contributor

madolson commented May 24, 2018

Hiya, thanks for the feedback, and sorry for taking so long to get back around to this.

Stuff that was fixed in the latest commit

  • Everything builds on MacOS now.
  • All the clang stuff is fixed
  • connection cleanup was refactored so it is in ssl.c, seemed cleaner.
  • Pegged to a specific commit of s2n per redis conf discussion. I would consider this a stop gab because they say they are going to release 1.0 soon, and then I'll pull that release.

Stuff that isn't quite fixed yet

  • SSL tests don't run on MacOS. Having trouble getting any ssl client to work for tcl, will keep trying.
  • Sentinel is still not fixed
  • Redis trib is not 100% ssl compliant yet

Stuff that was left the same:

  • clusterClientSetup still has the same behavior as the original pull request. The ping value is reset because the connection just came off of performing an SSL handshake, so the assumption is that it just exchanged a lot of data successfully and so it is ok to reset the ping.
  • aeGetFileProc still has the same behavior. You are required to pass in AE_WRITEABLE or AE_READABLE or it will always return null. Could add more checking around that.

I think that covers the minor stuff, to the bigger questions.

For instance, the same SSL implementation that works as a thread that proxies Redis connections via an UNIX socket would allow perfect separation of the code base. Is such approach so detrimental to be impossible to use, or is actually viable?

That solution will work, since it is effectively doing the same work that SSL proxies are doing today. It adds additional buffers and syscalls which will impact latency. It could be better for throughput if the process is bottle necked on CPU and if there are free cores available on the machine.

Or, as a middle ground, maybe it is possible to limit the places where the code is ifdeffed, and instead leave the fields and have a global ssl_enabled parameter, and higher level functions that do different things based on the fact SSL is used or not, and so forth. 

I think this is the good compromise. When submitting this pull request, we had a bit of discussion on how much/where to ifdef. There is probably only a couple of places ifdefs are strictly required if we made the cluster bus change: wrapping ssl.c/h since it contains s2n calls, the server macros because they reference ssl.c/h and preventing enable-ssl if it's not built. Everything else could check ssl at runtime. We were hoping to limit the impact in this request, and then peel off the ifdefs as we gain confidence in the implementation. We are running similar code same code today in ElastiCache, but there are more use cases then we currently allow and want to make sure it's robust.

I think @ccs018 answered the FQDN question well. Let me know if I missed something, and thanks again for considering this request :D

cd openssl && $(SSL_CONFIG)
cd openssl && $(MAKE) depend
cd openssl && $(MAKE)
cd openssl && $(MAKE) install

This comment has been minimized.

@josiahcarlson

josiahcarlson Jun 2, 2018

This can overwrite / blow away system libraries and docs. May want to remove this line.

lcharkiewicz added a commit to lcharkiewicz/telegraf that referenced this pull request Jun 5, 2018

Add SSL/TLS support to Redis input
SSL/TLS support is added to make Redis input working with services like
AWS Elasticache with Encryption-in-transit turned on, Redis with stunnel
or Redis with native SSL support (antirez/redis#4855).

@lcharkiewicz lcharkiewicz referenced this pull request Jun 5, 2018

Merged

Add SSL/TLS support to Redis input #4236

2 of 3 tasks complete

@dimakuv dimakuv referenced this pull request Aug 10, 2018

Open

SSL/TLS support #53

@theromis

This comment has been minimized.

theromis commented Oct 6, 2018

Super excited about this potential functionality in upcoming Redis versions, may I ask in which state this PR right now?

@madolson

This comment has been minimized.

Contributor

madolson commented Oct 8, 2018

The last I talked to antirez is that he would review it in more depth for a redis 6.0 since the changes are non-trivial. I'll update the PR whenever 5.0 becomes GA and antirez can give an estimate for when he'll be able to review it.

The big open question is the performance impact of Antirez's proposed solution of making it more of a proxy layer with unix sockets instead of deeply embedded implementation.

@theromis

This comment has been minimized.

theromis commented Oct 8, 2018

@madolson Great, waiting for updates in Redis 6.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment