Handle inflated numSubordinates in profile & LWCA subtrees #188

frasertweedale · 2019-04-04T02:04:09Z

The LDAP profile and LWCA monitor threads use the numSubordinates
attribute of the container object to determine how many entries it
expects to process. After processing that many entries, the monitor
thread releases the startup lock and startup continues.

If numSubordinates is greater than the number of entries the monitor
sees, startup hangs. This can occur e.g. when there are replication
conflict entries in the subtree (these do not appear in search
results, but still affect numSubordinates).

Handle this scenario by setting a watchdog timer and forcibly
unlocking the startup lock if it expires. This also gives us the
opportunity to log a warning that there may be conflict entries or
other extraneous entries in the subtree.

Fixes: https://pagure.io/dogtagpki/issue/3078

Changes:

5d17fd015 (Fraser Tweedale, 7 days ago)
   Add watchdog timer for initial load of LWCAs

   Similar to the work done for LDAPProfileSubsystem, to avoid hanging startup
   when the number of entries processed during initial load of LWCAs is less
   than suggested by the numSubordinates attribute of the container entry
   (replication conflict entries can cause this). Switch the authority monitor
   to use AsyncLoader which provides the watchdog timer, and takes care of
   some of the existing logic.

   Also add a log message when the startup await gets interrupted, to indicate
   that there may be replication conflicts or other extraneous data in the
   LWCA subtree.

   Related: https://pagure.io/dogtagpki/issue/3078

7b9db95ed (Fraser Tweedale, 5 months ago)
   LDAPProfileSubsystem: add watchdog timer for initial load

   During initial profile loading, if we receive fewer entries than indicated
   by the parent entry's numSubordinates attribute, the AsyncLoader will not
   unlock, and the Dogtag startup thread is blocked.  This situation can arise
   when there are entries that are contributing to the numSubordinates count,
   which are not visible to Dogtag.  Replication conflicts are one such
   example.

   The implementation currently uses a persistent search that also returns
   existing entries.  The alternative approach - a regular search followed by
   a persistent search - leaves open the possibility of missing replicated
   changes to the subtree that were processed in between the regular and
   persistent search.  Therefore we use a single search, which avoids this
   possibility.

   We also *do* want to block startup until all profiles are loaded. The
   system reporting ready before profiles are loaded has led to issues in CI
   and production environments.  During a persistent search, there is no
   in-band signal that indicates when all the
   "immediate" results have been delivered.  The solution was to read the
   numSubordinates value of the container to know how many immediate results
   to process.  So we have to address the corner cases discussed above.

   The approach to resolving this is to use a watchdog timer during initial
   load of profiles.  The AsyncLoader is now initialised with a timeout value
   (in seconds).  A timer is started and the lock is forcibly released after
   the timeout.  A value <= 0 suppresses the watchdog.  Update the
   LDAPProfileSubsystem to time out the loader after 10 seconds.  The existing
   behaviour of unlocking when the expected number of entries have been
   processed is maintained.

   Also add a log message when the start await gets interrupted, to indicate
   that there may be replication conflicts or other extraneous data in the
   profile configuration subtree.

   Fixes: https://pagure.io/dogtagpki/issue/3078

During initial profile loading, if we receive fewer entries than indicated by the parent entry's numSubordinates attribute, the AsyncLoader will not unlock, and the Dogtag startup thread is blocked. This situation can arise when there are entries that are contributing to the numSubordinates count, which are not visible to Dogtag. Replication conflicts are one such example. The implementation currently uses a persistent search that also returns existing entries. The alternative approach - a regular search followed by a persistent search - leaves open the possibility of missing replicated changes to the subtree that were processed in between the regular and persistent search. Therefore we use a single search, which avoids this possibility. We also *do* want to block startup until all profiles are loaded. The system reporting ready before profiles are loaded has led to issues in CI and production environments. During a persistent search, there is no in-band signal that indicates when all the "immediate" results have been delivered. The solution was to read the numSubordinates value of the container to know how many immediate results to process. So we have to address the corner cases discussed above. The approach to resolving this is to use a watchdog timer during initial load of profiles. The AsyncLoader is now initialised with a timeout value (in seconds). A timer is started and the lock is forcibly released after the timeout. A value <= 0 suppresses the watchdog. Update the LDAPProfileSubsystem to time out the loader after 10 seconds. The existing behaviour of unlocking when the expected number of entries have been processed is maintained. Also add a log message when the start await gets interrupted, to indicate that there may be replication conflicts or other extraneous data in the profile configuration subtree. Fixes: https://pagure.io/dogtagpki/issue/3078

Similar to the work done for LDAPProfileSubsystem, to avoid hanging startup when the number of entries processed during initial load of LWCAs is less than suggested by the numSubordinates attribute of the container entry (replication conflict entries can cause this). Switch the authority monitor to use AsyncLoader which provides the watchdog timer, and takes care of some of the existing logic. Also add a log message when the startup await gets interrupted, to indicate that there may be replication conflicts or other extraneous data in the LWCA subtree. Related: https://pagure.io/dogtagpki/issue/3078

cipherboy

I'm definitely not the best person to review this... The one thought I had was if a hard 10 second timeout is appropriate here.

I think the answer is yes, because at this point the LDAP should already be confirmed to be up and working. We're trying to load a number of entries, so even if the connection is really slow (e.g., huge LDAP db?), that still shouldn't hit our timeout limit. Making it configurable might help in some rare instances, but I'm not sure if we want to (or even if we have a reference to the configuration here).

Thoughts?

edewata · 2019-04-18T20:43:43Z

The code around this area is a bit complicated, so I'll review based on the description. I trust you about the code :)

Would it be safe to resume PKI server startup (after timeout) knowing that there's a conflict in the DS?

When a DS replication conflict happens (which will manifest as incorrect numSubordinates), is it something that DS will be able to detect and resolve automatically, or does it require manual intervention?

If DS can resolve it automatically, would the startup code become unblocked eventually (so no need for timeout)? Or is there a mechanism for DS to notify PKI server about the conflict resolution so it can interrupt the current persistent search and start a new one?

If it requires manual intervention, should PKI server actually terminate such that it doesn't provide incorrect information and/or cause further damage?

frasertweedale · 2019-04-28T23:28:19Z

@edewata good questions :) I will answer one at a time.

Would it be safe to resume PKI server startup (after timeout) knowing that there's a conflict in the DS?

We don't know for sure that this is due to conflict entries - it could be caused by other entries we haven't seen. But conflicts is the most common case. I would say it's OK to proceed because nothing else currently detects and halts on detection of conflict entries. i.e. it is sufficient to warn that there may be a conflict, then move on. Other projects (particularly IPA Health Check) are going to be responsible for detecting and alerting about DS conflicts.

When a DS replication conflict happens (which will manifest as incorrect numSubordinates), is it something that DS will be able to detect and resolve automatically, or does it require manual intervention?

AFAIK DS does not have automatic remediation.

If DS can resolve it automatically, would the startup code become unblocked eventually (so no need for timeout)? Or is there a mechanism for DS to notify PKI server about the conflict resolution so it can interrupt the current persistent search and start a new one?

Not applicable.

If it requires manual intervention, should PKI server actually terminate such that it doesn't provide incorrect information and/or cause further damage?

We are not detecting conflict entries in general, only in one subtree (and, again, it might not actually be a conflict entry at all - the whole thing about inflated numSubordinates is that we can't see the entries that inflated it!) My take is that the correct behaviour is to continue startup, and let other tools handle detection and remediation.

edewata · 2019-04-29T22:29:56Z

OK, just a note that standalone PKI installations will not have IPA health check to detect such issue. Regardless, DS replication conflict is outside the scope of PKI so feel free to check it in if you think the risk is manageable.

frasertweedale · 2019-04-29T23:49:27Z

OK, just a note that standalone PKI installations will not have IPA health check to detect such issue. Regardless, DS replication conflict is outside the scope of PKI so feel free to check it in if you think the risk is manageable.

Right, I meant to mention earlier but forgot: standalone PKI installations use file-based profiles by default, and LWCAs are only supported in IPA context, so yeah I think it is low-risk :) Thanks for the review @edewata.

frasertweedale added 2 commits April 4, 2019 12:22

frasertweedale added the needs-review label Apr 4, 2019

cipherboy reviewed Apr 18, 2019

View reviewed changes

edewata approved these changes Apr 29, 2019

View reviewed changes

frasertweedale merged commit 2157c4a into dogtagpki:master Apr 29, 2019

frasertweedale deleted the fix/3078-numSubordinates-startup-watchdog branch April 29, 2019 23:55

This was referenced Apr 30, 2019

[10.6] handle inflated numSubordinates in profile/LWCA loader #200

Merged

[10.5] handle inflated numSubordinates #201

Merged

frasertweedale removed the needs-review label Jul 10, 2019

This was referenced Oct 1, 2020

startup initialization should not depend on LDAP operational attributes pki-bot/pki-issues-final#2627

Closed

startup initialization should not depend on LDAP operational attributes #3195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle inflated numSubordinates in profile & LWCA subtrees #188

Handle inflated numSubordinates in profile & LWCA subtrees #188

frasertweedale commented Apr 4, 2019

cipherboy left a comment

edewata commented Apr 18, 2019 •

edited

frasertweedale commented Apr 28, 2019

edewata commented Apr 29, 2019

frasertweedale commented Apr 29, 2019

Navigation Menu

Handle inflated numSubordinates in profile & LWCA subtrees #188

Handle inflated numSubordinates in profile & LWCA subtrees #188

Conversation

frasertweedale commented Apr 4, 2019

cipherboy left a comment

Choose a reason for hiding this comment

edewata commented Apr 18, 2019 • edited

frasertweedale commented Apr 28, 2019

edewata commented Apr 29, 2019

frasertweedale commented Apr 29, 2019

edewata commented Apr 18, 2019 •

edited