New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle inflated numSubordinates in profile & LWCA subtrees #188
Handle inflated numSubordinates in profile & LWCA subtrees #188
Conversation
During initial profile loading, if we receive fewer entries than indicated by the parent entry's numSubordinates attribute, the AsyncLoader will not unlock, and the Dogtag startup thread is blocked. This situation can arise when there are entries that are contributing to the numSubordinates count, which are not visible to Dogtag. Replication conflicts are one such example. The implementation currently uses a persistent search that also returns existing entries. The alternative approach - a regular search followed by a persistent search - leaves open the possibility of missing replicated changes to the subtree that were processed in between the regular and persistent search. Therefore we use a single search, which avoids this possibility. We also *do* want to block startup until all profiles are loaded. The system reporting ready before profiles are loaded has led to issues in CI and production environments. During a persistent search, there is no in-band signal that indicates when all the "immediate" results have been delivered. The solution was to read the numSubordinates value of the container to know how many immediate results to process. So we have to address the corner cases discussed above. The approach to resolving this is to use a watchdog timer during initial load of profiles. The AsyncLoader is now initialised with a timeout value (in seconds). A timer is started and the lock is forcibly released after the timeout. A value <= 0 suppresses the watchdog. Update the LDAPProfileSubsystem to time out the loader after 10 seconds. The existing behaviour of unlocking when the expected number of entries have been processed is maintained. Also add a log message when the start await gets interrupted, to indicate that there may be replication conflicts or other extraneous data in the profile configuration subtree. Fixes: https://pagure.io/dogtagpki/issue/3078
Similar to the work done for LDAPProfileSubsystem, to avoid hanging startup when the number of entries processed during initial load of LWCAs is less than suggested by the numSubordinates attribute of the container entry (replication conflict entries can cause this). Switch the authority monitor to use AsyncLoader which provides the watchdog timer, and takes care of some of the existing logic. Also add a log message when the startup await gets interrupted, to indicate that there may be replication conflicts or other extraneous data in the LWCA subtree. Related: https://pagure.io/dogtagpki/issue/3078
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm definitely not the best person to review this... The one thought I had was if a hard 10 second timeout is appropriate here.
I think the answer is yes, because at this point the LDAP should already be confirmed to be up and working. We're trying to load a number of entries, so even if the connection is really slow (e.g., huge LDAP db?), that still shouldn't hit our timeout limit. Making it configurable might help in some rare instances, but I'm not sure if we want to (or even if we have a reference to the configuration here).
Thoughts?
The code around this area is a bit complicated, so I'll review based on the description. I trust you about the code :) Would it be safe to resume PKI server startup (after timeout) knowing that there's a conflict in the DS? When a DS replication conflict happens (which will manifest as incorrect numSubordinates), is it something that DS will be able to detect and resolve automatically, or does it require manual intervention? If DS can resolve it automatically, would the startup code become unblocked eventually (so no need for timeout)? Or is there a mechanism for DS to notify PKI server about the conflict resolution so it can interrupt the current persistent search and start a new one? If it requires manual intervention, should PKI server actually terminate such that it doesn't provide incorrect information and/or cause further damage? |
@edewata good questions :) I will answer one at a time.
We don't know for sure that this is due to conflict entries - it could be caused by other entries we haven't seen. But conflicts is the most common case. I would say it's OK to proceed because nothing else currently detects and halts on detection of conflict entries. i.e. it is sufficient to warn that there may be a conflict, then move on. Other projects (particularly IPA Health Check) are going to be responsible for detecting and alerting about DS conflicts.
AFAIK DS does not have automatic remediation.
Not applicable.
We are not detecting conflict entries in general, only in one subtree (and, again, it might not actually be a conflict entry at all - the whole thing about inflated |
OK, just a note that standalone PKI installations will not have IPA health check to detect such issue. Regardless, DS replication conflict is outside the scope of PKI so feel free to check it in if you think the risk is manageable. |
Right, I meant to mention earlier but forgot: standalone PKI installations use file-based profiles by default, and LWCAs are only supported in IPA context, so yeah I think it is low-risk :) Thanks for the review @edewata. |
The LDAP profile and LWCA monitor threads use the numSubordinates
attribute of the container object to determine how many entries it
expects to process. After processing that many entries, the monitor
thread releases the startup lock and startup continues.
If numSubordinates is greater than the number of entries the monitor
sees, startup hangs. This can occur e.g. when there are replication
conflict entries in the subtree (these do not appear in search
results, but still affect numSubordinates).
Handle this scenario by setting a watchdog timer and forcibly
unlocking the startup lock if it expires. This also gives us the
opportunity to log a warning that there may be conflict entries or
other extraneous entries in the subtree.
Fixes: https://pagure.io/dogtagpki/issue/3078
Changes: