Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Q: Fault tolerant LDAP connection? #1548

Open
Martin-Weiss opened this issue Sep 19, 2019 · 17 comments
Open

Q: Fault tolerant LDAP connection? #1548

Martin-Weiss opened this issue Sep 19, 2019 · 17 comments

Comments

@Martin-Weiss
Copy link

Is it possible to configure dex in a way that is connecting to one LDAP server and in case that is not reachable fail over to a second LDAP server?

I.e. in Active Directory environments fail-over is required if one domain controller gets updated / rebooted - during that period of time the LDAP/AD clients need to fail over to a second one which is providing the same data for fault tolerance and scalability.

@bonifaido
Copy link
Member

Have you considered using a tcp Load Balancer for this?

@Martin-Weiss
Copy link
Author

Have you considered using a tcp Load Balancer for this?

Yes - but - in case of "load balancer" this requires additional complexity in the architecture and many "on premise deployments" just do not have load balancers between k8s and the Active Domain controllers networks. Also LDAP fault tolerance is similar like DNS fault tolerance - handling the failure is on the client side per design..
For AD connectivity it might even be required to honor Sites and Services configuration for fault tolerant LDAP connectivity..

Do you have an idea what will break / not work in case Dex can not reach the specified LDAP server?

Is there identity and group caching in Dex so that even without LDAP the RBAC authentication / authorization continues to work while LDAP is not reachable?

Do you have an idea about running a load balancer within the k8s deployment so we could configure Dex -> K8S load balancer -> Active Directory?

@bonifaido
Copy link
Member

The current LDAP Go client in Dex is not capable of handling such scenarios, so we would have to implement some custom reconnection logic.

There is no such cache in the LDAP connector, if the server is unreachable it is not possible to login.

For a load balancer, I would put a HAProxy or a NGiNX sidecar next to Dex.

@geruetzel
Copy link

I agree with @Martin-Weiss - we all know how to loadbalance services, but such a setup is not possible in every scenario. it would be really great, if dex would support multiple ldap-servers per connector.

@Martin-Weiss
Copy link
Author

In the meantime I did some research and found some "ideas" to work around this problem - but non of the ideas seems to be production ready nor fulfill all requirements i.e. regarding LDAP target health checks.

Basically we could do
a) configure service with custom outgoing endpoints
-> easy to configure, but no ldap health checks
b) sidecar load balancer for outgoing connections
-> complicated to configure and monitor / manage
c) use istio or similar for outgoing communication (uses sidecar if I understand this right)
-> even more complicated to configure and monitor / manage
d) additional ldap proxy / load balancer just for 636 and ldap
-> complicated to configure and monitor / manage, additional hardware required,
often not possible due to network and load balancer connectivity

While a) seems to be the most simple one - all of the above just can do TCP check which unfortunately is not sufficient for LDAP connectivity fault tolerance. In case the ldap port is open but the ldap server behind the port does not respond as expected.. (application layer failure)

The health check for the ldap connect would need to do "ldap bind" -> check if the ldap communication works at all, "ldap search" -> check if a result comes back or if we get an error and failover if ldap bind or search do not work / timeout against the first given LDAP server. All also needs to be done through SSL which makes load balancing / health check and failover even more difficult.

So is there any chance to enhance the ldap client to allow multiple ldap servers to be configured and use a failover mechanism to use a second / third one in case the first one for a given directory does not respond or has an error during bind/search? Then stay with the second one until this fails and so on... with a final error in case all of the configured ldap servers do not work as expected maybe with some retries...

@phiremande
Copy link

I am new to this code base. But with an initial look at the code, I am thinking this request can be satisfied by implementing a new connector called 'ldap_cluster' (or similar) which reads configuration into an array of config items. I can give it a try if you think this is acceptable.

@bonifaido
Copy link
Member

I have also checked the LDAP client since it can't handle multiple addresses I agree that we would need something like @phiremande suggests. I'm okay with such a change!

@phiremande
Copy link

Thanks @bonifaido. I will try to come up with the changes needed.

@jenting
Copy link

jenting commented Nov 29, 2019

A workaround is to add multiple LDAP connectors by config.

@Martin-Weiss
Copy link
Author

A workaround is to add multiple LDAP connectors by config.

I assume you mean configure two identical connectors and give them different IDs so that each user and group is available via two or three connectors?
Have you tested this?
How long is the failover in case there is such a duplicate config and the first one does not work?
Does this also work well with oicd / gangway?

@geruetzel
Copy link

A workaround is to add multiple LDAP connectors by config.

I assume you mean configure two identical connectors and give them different IDs so that each user and group is available via two or three connectors?
Have you tested this?
How long is the failover in case there is such a duplicate config and the first one does not work?
Does this also work well with oicd / gangway?

This does NOT work, as you are presented with the option to click and choose each connector upon logging in. If one of the connector backends is not available, it would not change anything. You still have to choose which connector you want to use during login. If the first one does not work, you would have to click the second connector. So this is NOT a solution to the given problem.

@phiremande
Copy link

@Martin-Weiss, have been making slow progress on this. Please see https://github.com/phiremande/dex/tree/feature-ldapcluster and check https://github.com/phiremande/dex/blob/feature-ldapcluster/examples/config-ldapcluster.yaml for example configuration.
I have done some basic testing with 2 ldap servers in the cluster with Login/Refresh. I am yet to write UT code.
In case you wish to, please do clone/compile and try it out and let me know about your initial feedback.

@Martin-Weiss
Copy link
Author

@Martin-Weiss, have been making slow progress on this. Please see https://github.com/phiremande/dex/tree/feature-ldapcluster and check https://github.com/phiremande/dex/blob/feature-ldapcluster/examples/config-ldapcluster.yaml for example configuration.
I have done some basic testing with 2 ldap servers in the cluster with Login/Refresh. I am yet to write UT code.
In case you wish to, please do clone/compile and try it out and let me know about your initial feedback.

@phiremande , wow - great to see you could find some time to work on this!

I am not a developer so I do not understand the code in detail but I can see that we are able to specify multiple LDAP servers within a cluster with separate filters and bind configurations - great!! :-).

Could you give some background on how the logic for connecting and failover is build?
When and how does the failover from one to the other LDAP server happen and how is verified if one or the other LDAP server is reachable and working properly?
Does the LDAP client stay with the new selected LDAP server or does it fail back for the next request? (so it might slow down for all requests if the first one is not reachable)

Basically I would assume that we would use the same filters against all the LDAP servers in an LDAP cluster as they should have identical content (i.e. Active Directory Domain Controllers that are replicated). So having different filters might not be required.

For failover - it would be nice if we could switch between the configured LDAP servers if one gives an error during bind or search - but not failover if just the search does give an empty result. And IMO we should not failback before an other error happens.
Probably we should have a config option for failover and failback or even roundrobin for each request?

I am also not sure if we might need configurable timeouts for the LDAP connect and failover - or some retries..

Again - thanks for the great progress and step forward :-).

@phiremande
Copy link

phiremande commented Dec 9, 2019

@Martin-Weiss , thanks for your response. I am not expecting code review inputs at this stage, but just the functioning based on the initial code (if you build and run the dex binary with multiple LDAP servers). I probably should have given the background on the design when I asked for input, sorry for that. Below is the design that is incorporated currently.

  1. During login (bind), first time, all the servers are tried in round-robin.
  2. Once successful, against a server, that server is marked "active".
  3. Subsequent logins/Refresh will keep using the "active server", (bind/search will be against this server) till a bind error occurs.
  4. During step 3, if active server fails, then all servers are tried again in round robin fashion.
  5. Only bind is tried in round-robin (i.e. during Login/Refresh). A search following this bind, is tried with the server that is deemed active.
  6. User groups' search too, is tried only with an active server.

So essentially, bind is round-robin and any subsequent search is to active (server against which bind succeeded), server only.
This Connector (like existing connectors), exposes only Login/Refresh, since from a end user perspective either he is trying to login (to get token), or there is a token refresh being done. Hence I believe, round robin for bind is what we need. A search happens immediately after the bind, and hence a failure during search is not retired against all servers.
Let me know if you have other thoughts. @bonifaido I would request your inputs too since you have looked at the code and probably know how the connectors work.

@Martin-Weiss
Copy link
Author

Thanks a lot for the details - so yes - this sounds like what we need!
We will not get load distribution but that is not required anyway as the expected load is not high "just login - and after ~24h the token refresh".
I am not sure how often the group membership is updated - does this also happen only during login and token refresh?
Regarding the round robin - basically I understand we take the first one as long as the first one works. If the first one does not work (bind error) - we will take the second one and when the second one fails we fall back to the first one or third one in the list?

I believe we should have an ordered list so the first one is taken when ever possible and we might also need a fallback after some time. Reason: in case we have LDAP/Active Directory we might have a remote LDAP server and we might have a local LDAP server and we should always use the local one if possible - only use the central / remote one in case the local one is not available.
So failover to central seems to work with the current scenario - it is just the fallback to local which seems to be requiring a WAN outage or central LDAP failure for fallback.
Maybe we can add a fallback timer or similar?
Other than this - thanks a lot!! I am really looking forward to get this feature in production ;-).

@Fixmetal
Copy link

Fixmetal commented Dec 17, 2020

I just bumped on this issue. Is there any intention of implementation in any near future release?

@alexei-matveev
Copy link

alexei-matveev commented Mar 14, 2023

I think this might be relevant in multisite AD Environment:

https://ldap.com/dns-srv-records-for-ldap/

I also noted that DialTLS() is deprecated (in comments) in the go-ldap library in favor of DialURL():

https://github.com/go-ldap/ldap/blob/master/v3/conn.go#L198

I do not see immediately if DialURL() offers any advantages for "fault tolerance".
But a similar issue is open in the go-ldap Repo too:

go-ldap/ldap#314

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants