-
Notifications
You must be signed in to change notification settings - Fork 23.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sentinel: Fix initial Hello source address #1908
Conversation
I previously added source binding to sentinelReconnectInstance(), but I missed adding it to the Hello flow. Thanks to @dkong for tracking down the problem in dkong@9b6eb0e
We can save a little work by aborting when we enter the function if we're disconnected.
Good idea, but the implementation has some issues (memcpy of buffer with a fixed length, but the source pointer is not guaranteed to point to REDIS_IP_STR_LEN bytes). I'm rewriting it to merge today for 2.8.14. |
No sorry I still don't fully understand the goal of this patch, because in theory, with anetSockName(), we are already extracting the address from the socket which should already be bound to the right address because of @mattsta previous patches. Please could you specify what is going on here? Thanks. I tried to read the @dkong patch but apparently it tries to do a different thing (to fix the known issue of a public address that is not among the actual local addresses). |
p.s. in the meantime I applied the "Abort Hello quicker if not connected …" commit at least. |
We have people doing this: Datacenter A (or VM or container):
External IP 1 -> NAT -> Redis Sentinel 1
Datacenter B:
External IP 2 -> NAT -> Redis Sentinel 2 Attempts at automatic IP discovery will not be correct in this situation since Redis can never see the public IP. Redis then starts to gossip around an internal IP no other hosts can connect against. We previously fixed this in reconnection logic, but not in the Hello section. So...... this patch is in the |
@mattsta the part I don't understand is how the user can bind an address which is not among the listed local addresses. In other terms, this is the proposed change:
at a first glance, if the bind address was configured, then, because of your past changes, anetSockName() will already return this bound address. Otherwise if the address provided by bind is not among the local addresses of the instance, the server will not start and exit with an error. Basically the code above does not replicate (unless I'm missing something, which is probable) what the original user intended, that was the ability to announce an address which is a totally arbitrary address that the computer where Redis is running does not have any idea about. |
Two quick notes before longer message:
Linux has options allowing bind of IPs that don't exist on the machine (
Correct, if the nonlocal_bind feature isn't enabled. Binding to non-existing IPs is typically used for HA situations where one server is standby for another. Then the standby software can be pre-running and bound to the failover IP even though it doesn't exist on the standby server until the failover happens.
Looking closer, that should be what is happening. Is it possible for a socket connected with an explicit outbound At first, I thought Hello was happening before |
Hi @mattsta @antirez . I can reproduce the issue consistently with my AWS EC2 Ubuntu instances. Let me know if there's anything I can do to help debug. However, the proposed solution of using the Remote machines can connect when sentinel binds to everything:
Remote machines cannot connect when sentinel binds to its public IP:
In my patch dkong@9b6eb0e, I separated the two features into distinct config options:
That approach worked for me. Thanks. |
Yeah, looking at your setup again, specifying the external IP here won't work at all. EC2 is just doing a Public IP->Private IP NAT, so your system has to bind to the private interface. We saw something similar before at #1667, but they claim the bind fix worked for them. So, with your setup, there's no way for Sentinel to advertise the public IP because Redis can't self-discover the public IP to announce it to other instances, so that's why you added the new config directive. The only way Sentinel can announce an arbitrary IP is to add a new config directive like you did in your patch. |
Thank you both for your detailed explanation, I was not aware of this Linux option that allowed to bind non-local addresses. Between this being Linux-only, and the complexity involved (apparently it does not always work), I think it is a better pick for us to have something like:
Where Sentinel will just pretend to be that address in the HELLO messages. I'm reviewing the patch originally provided by @dkong before to write some code, news ASAP. |
Ok I'm using @dkong commit, and committing changes to this commit in order to modify a bit the code and the behavior. More info ASAP. |
Final set of commits:
The original patch was modified in the following ways:
I tested the implementation manually for some time, but there is still no unit test for this feature since it is not trivial to implement and the feature is non critical compared to other tests still missing. Merging into 2.8 and 3.0 branch. Thank you for the help! |
p.s. in order to merge to non-unstable branches I'll wait for your comments. |
hi, when are you guys planning to push to stable? this is critical for Docker instances of sentinel (and hence my application). thanks for all your hard work (in general) on Redis... it's an awesome tool, and community... I look forward to contributing some day (when I'm a better coder). |
@joshula it was already released with 2.8.15, however there is still the problem of the fact that if you spin your salves with Docker as well, they'll be listed with a wrong address in the Redis master INFO output, so Sentinel will still not work. |
I am doing that, but it doesn't seem to be a problem for me. I have 1 master and 2 slaves, each running in a separate docker container, all on different hosts. When I REDIS-CLI into the master and run INFO or ROLE, I see the correct addresses for my 2 slaves (the IP address of the Hosts machines). Also, when I look at my sentinel logs, it knows about my slaves, and has them listed at their proper addresses. It's the other sentinel addresses (in the sentinel logs) that are incorrect (listing the IP of the docker VM instead of the host machine). I am running 2.8.15, and I've attempted to start them up with a bunch of different commands (to fix this): |
@anyone |
The main issue here is people are starting to use Redis in environments where Redis wasn't designed to work. A normal deployment has all services contained and reachable in the same IP space. But now, some wacky people want to run each Redis server in a little IP island with translation units sitting between every IP internal IP space. So, instead of running: Container A -> [Reachable IP A]
[Isolated IP 1]
Container B -> [Reachable IP B]
[Isolated IP 2]
Container C -> [Reachable IP C]
[Isolated IP 3] Redis is designed to announce itself using the IP address it's using, but when Redis will need a more extensive review in how to properly define and announce IPs for Sentinel and Cluster use cases when the IP Redis is using isn't actually reachable by any other servers. The easiest way to fix the current problem is to use interface bridging so in-container IPs are directly reachable from outside the container too. |
@joshula The fix was implemented as a config option. Did you modify the sentinel config appropriately? https://github.com/antirez/redis/blob/2.8.15/sentinel.conf#L7-L8
|
So, the new sentinel announce options let sentinel use specific announced addresses, but the individual Redis servers (and Sentinel) still don't know the Redis server reachable IP addresses. One trick you can try: conrigure the We covered that a bit here: #1908 (comment) but I think I forgot to mention you should also list the internal IP after the reachable IP so Redis is still listening locally. (If that works, we don't actually need the sentinel announce options if your NAT layer port-forwards the same ports to the backend server.) |
Thanks for helping. I really appreciate it. @dkong @mattsta |
@joshula My mistake. I didn't know about the "passing config arguments via command line" feature. This command seems to work for me:
|
Similar problems with FTP and SIP behind NAT exposing internal address are solved in the kernel with iptables/nftables helper modules (e.g. FTP, SIP). So eventually there could be a nf_conntrack_redis.c module maintained separately, as it requires some work and effort to make redis and sentinel handle correctly this translation internally. |
@dkong |
I did a bunch of testing. Still doesn't work. But I identified a new problem (that I didn't know I had before). I can't connect to my running Sentinel containers (using redis-cli) at all (it's telling me connection refused). However, I know they are running, and can properly discover my master and slaves (by viewing their logs). That may be what @mattsta was trying to tell me before? They aren't listening at the proper address? Any ideas on next steps? |
@mattsta |
Nope, no idea about any of that yet. I'd recommend re-reading the Docker Docs to figure out how things work, then use the tricks we've mentioned above to make Redis do the right thing. 👹 |
After a ton of work, I ended up figuring this out. Here's to making it simple for anyone else who wants to deploy a highly available redis instance via Docker. Thanks for all the guidance. |
@joshula thanks so much for creating that image! It seems that the creating of the config file from command line arguments is broken. Check the repo comments and let me know when that's resolved 😄 |
I previously added source binding to sentinelReconnectInstance(),
but I missed adding it to the Hello flow.
All tests pass.
Thanks to @dkong for tracking down the problem in
dkong@9b6eb0e