-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker Stability #67
Comments
Hi there, Thanks for reporting this. I'll see if I can reproduce this. Could you also post any relevant logs from motley-cue, e.g. in |
Hey, here are all logs for a failed login attempt. It seems that the user if checked any everything, however this information is not passed along correctly, resulting in a timeout: feudal.log
motley_cue.gunicorn.error
mapper.log ( I removed the access token and the user info)
|
Thanks! Do you also have an error in
It could an error on the |
In the nginx logs I indeed see something like this:
|
OK. I would suggest to first try increasing the various timeout settings to see if the problem persists. For nginx, you could modify
Restart Similarly, for
It would also help to know where exactly the
|
I just ran a test again with the increased timeouts, still getting the same error, these are the logs, essentially, there is nothing logged for the duration of the timeout, only in the very beginning and after the timeout time is passed. In the feudal.log
gunicorn
nginx
|
I just checked, in a successful attempt, the ldap backend retuns a statement that the user exists, and then the account is allowed to login
Could this be the cause of the timeout ? |
Indeed, I think in the first case the ldap search operation times out, and the second time it succeeds probably due to caching. I'll need to dig into the feudal-adapter code and see how the performance of ldap queries can be optimised. |
Do you have any logs from your LDAP server, e.g. which operation times out, as well as some information on how the ldap is configured? |
Hi, I do not have access to the LDAP server, its run by KIT, one of the fels ldap servers. From the sssd logs, I cannot see anything out of the ordinary, but I guess the ldap search operations are done directly via python not via sssd. Concerning the configuration on the client side, see below the client configurations for sssd and feudal
|
I had a closer look into this issue today, and found the issue as well as a solution. The problem is, that this that the the LDAP Connections are opened at the startup of motley (or the feudal part of it) [1] . Since the LDAP instance that we are using is behind a firewall and might also be restarted as some time, there is a high chance, that the connection that was opened at the startup of the worker is not available anymore. In order to avoid this effect, I would propose to add an additional parameter (something like maybe there is also a way to check, if the connection is not working anymore within the LDAP package, that could even be a nicer solution to this problem. |
Thanks for tracking down the issue! Looking at the ldap3 docs, using SAFE_RESTARTABLE when initiating the connection might fix this. |
I had also tried this, as well as using a ConnectionPool with multiple connections
but both approaches did not result in a successful fix. I think the problem is, that the client does not get any information that the connection was closed and thus it is not restarted or handled in any way |
For documentation, this is the modification, that we are now using for nearly two weeks without any problems (with some additional log information) index e851bbe..791a2d1 100644
--- a/ldf_adapter/backend/ldap.py
+++ b/ldf_adapter/backend/ldap.py
@@ -7,6 +7,8 @@ It"s in the proof-of-concept state.
# pylint: disable=raise-missing-from, missing-docstring, too-few-public-methods
import logging
+import time
+import threading
from ldap3 import (
AUTO_BIND_NO_TLS,
SAFE_RESTARTABLE,
@@ -16,6 +18,7 @@ from ldap3 import (
MODIFY_REPLACE,
MODIFY_DELETE,
MODIFY_ADD,
+ SAFE_RESTARTABLE,
SAFE_SYNC,
LDIF,
)
@@ -134,7 +137,13 @@ class LdapConnection:
self.uid_max = CONFIG.backend.ldap.uid_max
self.gid_min = CONFIG.backend.ldap.gid_min
self.gid_max = CONFIG.backend.ldap.gid_max
+ self.refresh_interval = 300
+ self.stop_refresh = threading.Event()
+ threading.Thread(target=self.refresh_connection, daemon=True).start()
+ self.connect()
+
+ def connect(self):
# initialise connection used to generate LDIFs
self.ldif_connection = Connection(server=None, client_strategy=LDIF)
self.ldif_connection.bind()
@@ -165,6 +174,17 @@ class LdapConnection:
logger.error(f"{msg}: {e}")
raise Failure(message=msg)
+
+
+ def refresh_connection(self):
+ while not self.stop_refresh.is_set():
+ try:
+ logger.info("Refreshing LDAP connection...")
+ self.connect()
+ except Exception as e:
+ logger.error(f"Error refreshing LDAP connection: {e}")
+ time.sleep(self.refresh_interval)
+
def init_nextuidgid(self):
"""Initialise uidNext and gidNext entries in FULL_ACCESS mode
with values starting in configured range.
@@ -204,7 +224,17 @@ class LdapConnection:
raise Failure(message=msg)
def search_user_by_oidc_uid(self, oidc_uid, attributes=[]):
- return LdapSearchResult(
+ logger.debug(f"Searching user by oidc_uid: {oidc_uid}")
+ query = [
+ f"{self.user_base}",
+ f"(&({self.attr_oidc_uid}={oidc_uid})(objectClass=inetOrgPerson)(objectClass=posixAccount))",
+ ]
+ logger.debug(f"Query: {query}")
+ logger.debug(f"Attributes: {attributes}")
+ logger.debug(f"self.user_base: {self.user_base}")
+ logger.debug(f"Starting timer...")
+ timer = time.time()
+ result = LdapSearchResult(
self.connection,
[
f"{self.user_base}",
@@ -212,6 +242,8 @@ class LdapConnection:
],
{"attributes": attributes},
)
+ logger.debug(f"Search took {time.time() - timer} seconds.")
+ return result
def search_user_by_local_username(self, username, get_unique_id=False):
return LdapSearchResult(
@@ -646,7 +678,9 @@ class User(generic.User):
If this returns True, calling `create` should have no effect or raise an error.
"""
logger.info(f"Check if user exists: {self.unique_id}")
- return LDAP.search_user_by_oidc_uid(self.unique_id, attributes=[]).found()
+ result = LDAP.search_user_by_oidc_uid(self.unique_id, attributes=[]).found()
+ logger.info(f"User {self.unique_id} exists: {result}")
+ return result
def name_taken(self, name):
"""Return whether the username is already taken by another user on the service,
|
Thanks! It's great that you found a solution that works for you. I tested it and it doesn't seem to have any issues. I did implement another solution that makes sure the connection is open before every operation and it seems to work well in my tests. You can try it in the prerelease repository (latest version: |
Hello,
I have an odd error, that I am not sure how to debug without going deep into the code of the tool. We have a working setup of motley-cue but I notice, that It happens quite often, that a user login runs into a timeout. In the logs on debug level, I only observe, that a worker (prop the one that is assigned to handling this specific user login) is terminated without any additional information
gunicorn error log
I can retry the log-in a second time, and it works as intended. Once the login worked, it's also consistently working, which makes it a bit hard to debug this issue. We are using the SQLite backend for the mapping of OTP and gunicorn is running with 20 workers. Below, I attached some more logs of the problem.
Is this something that is known to you and points to a configuration error on our side?
Best Wishes
Sebastian
mccli log from a failed login
mccli log from successful login
Motley Version
The text was updated successfully, but these errors were encountered: