Navigation Menu

Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle inflated numSubordinates in profile & LWCA subtrees #188

Conversation

frasertweedale
Copy link
Contributor

The LDAP profile and LWCA monitor threads use the numSubordinates
attribute of the container object to determine how many entries it
expects to process. After processing that many entries, the monitor
thread releases the startup lock and startup continues.

If numSubordinates is greater than the number of entries the monitor
sees, startup hangs. This can occur e.g. when there are replication
conflict entries in the subtree (these do not appear in search
results, but still affect numSubordinates).

Handle this scenario by setting a watchdog timer and forcibly
unlocking the startup lock if it expires. This also gives us the
opportunity to log a warning that there may be conflict entries or
other extraneous entries in the subtree.

Fixes: https://pagure.io/dogtagpki/issue/3078

Changes:

5d17fd015 (Fraser Tweedale, 7 days ago)
   Add watchdog timer for initial load of LWCAs

   Similar to the work done for LDAPProfileSubsystem, to avoid hanging startup
   when the number of entries processed during initial load of LWCAs is less
   than suggested by the numSubordinates attribute of the container entry
   (replication conflict entries can cause this). Switch the authority monitor
   to use AsyncLoader which provides the watchdog timer, and takes care of
   some of the existing logic.

   Also add a log message when the startup await gets interrupted, to indicate
   that there may be replication conflicts or other extraneous data in the
   LWCA subtree.

   Related: https://pagure.io/dogtagpki/issue/3078

7b9db95ed (Fraser Tweedale, 5 months ago)
   LDAPProfileSubsystem: add watchdog timer for initial load

   During initial profile loading, if we receive fewer entries than indicated
   by the parent entry's numSubordinates attribute, the AsyncLoader will not
   unlock, and the Dogtag startup thread is blocked.  This situation can arise
   when there are entries that are contributing to the numSubordinates count,
   which are not visible to Dogtag.  Replication conflicts are one such
   example.

   The implementation currently uses a persistent search that also returns
   existing entries.  The alternative approach - a regular search followed by
   a persistent search - leaves open the possibility of missing replicated
   changes to the subtree that were processed in between the regular and
   persistent search.  Therefore we use a single search, which avoids this
   possibility.

   We also *do* want to block startup until all profiles are loaded. The
   system reporting ready before profiles are loaded has led to issues in CI
   and production environments.  During a persistent search, there is no
   in-band signal that indicates when all the
   "immediate" results have been delivered.  The solution was to read the
   numSubordinates value of the container to know how many immediate results
   to process.  So we have to address the corner cases discussed above.

   The approach to resolving this is to use a watchdog timer during initial
   load of profiles.  The AsyncLoader is now initialised with a timeout value
   (in seconds).  A timer is started and the lock is forcibly released after
   the timeout.  A value <= 0 suppresses the watchdog.  Update the
   LDAPProfileSubsystem to time out the loader after 10 seconds.  The existing
   behaviour of unlocking when the expected number of entries have been
   processed is maintained.

   Also add a log message when the start await gets interrupted, to indicate
   that there may be replication conflicts or other extraneous data in the
   profile configuration subtree.

   Fixes: https://pagure.io/dogtagpki/issue/3078

During initial profile loading, if we receive fewer entries than
indicated by the parent entry's numSubordinates attribute, the
AsyncLoader will not unlock, and the Dogtag startup thread is
blocked.  This situation can arise when there are entries that are
contributing to the numSubordinates count, which are not visible to
Dogtag.  Replication conflicts are one such example.

The implementation currently uses a persistent search that also
returns existing entries.  The alternative approach - a regular
search followed by a persistent search - leaves open the possibility
of missing replicated changes to the subtree that were processed in
between the regular and persistent search.  Therefore we use a
single search, which avoids this possibility.

We also *do* want to block startup until all profiles are loaded.
The system reporting ready before profiles are loaded has led to
issues in CI and production environments.  During a persistent
search, there is no in-band signal that indicates when all the
"immediate" results have been delivered.  The solution was to read
the numSubordinates value of the container to know how many
immediate results to process.  So we have to address the corner
cases discussed above.

The approach to resolving this is to use a watchdog timer during
initial load of profiles.  The AsyncLoader is now initialised with a
timeout value (in seconds).  A timer is started and the lock is
forcibly released after the timeout.  A value <= 0 suppresses the
watchdog.  Update the LDAPProfileSubsystem to time out the loader
after 10 seconds.  The existing behaviour of unlocking when the
expected number of entries have been processed is maintained.

Also add a log message when the start await gets interrupted, to
indicate that there may be replication conflicts or other extraneous
data in the profile configuration subtree.

Fixes: https://pagure.io/dogtagpki/issue/3078
Similar to the work done for LDAPProfileSubsystem, to avoid hanging
startup when the number of entries processed during initial load of
LWCAs is less than suggested by the numSubordinates attribute of the
container entry (replication conflict entries can cause this).
Switch the authority monitor to use AsyncLoader which provides the
watchdog timer, and takes care of some of the existing logic.

Also add a log message when the startup await gets interrupted, to
indicate that there may be replication conflicts or other extraneous
data in the LWCA subtree.

Related: https://pagure.io/dogtagpki/issue/3078
Copy link
Member

@cipherboy cipherboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely not the best person to review this... The one thought I had was if a hard 10 second timeout is appropriate here.

I think the answer is yes, because at this point the LDAP should already be confirmed to be up and working. We're trying to load a number of entries, so even if the connection is really slow (e.g., huge LDAP db?), that still shouldn't hit our timeout limit. Making it configurable might help in some rare instances, but I'm not sure if we want to (or even if we have a reference to the configuration here).

Thoughts?

@edewata
Copy link
Contributor

edewata commented Apr 18, 2019

The code around this area is a bit complicated, so I'll review based on the description. I trust you about the code :)

Would it be safe to resume PKI server startup (after timeout) knowing that there's a conflict in the DS?

When a DS replication conflict happens (which will manifest as incorrect numSubordinates), is it something that DS will be able to detect and resolve automatically, or does it require manual intervention?

If DS can resolve it automatically, would the startup code become unblocked eventually (so no need for timeout)? Or is there a mechanism for DS to notify PKI server about the conflict resolution so it can interrupt the current persistent search and start a new one?

If it requires manual intervention, should PKI server actually terminate such that it doesn't provide incorrect information and/or cause further damage?

@frasertweedale
Copy link
Contributor Author

@edewata good questions :) I will answer one at a time.

Would it be safe to resume PKI server startup (after timeout) knowing that there's a conflict in the DS?

We don't know for sure that this is due to conflict entries - it could be caused by other entries we haven't seen. But conflicts is the most common case. I would say it's OK to proceed because nothing else currently detects and halts on detection of conflict entries. i.e. it is sufficient to warn that there may be a conflict, then move on. Other projects (particularly IPA Health Check) are going to be responsible for detecting and alerting about DS conflicts.

When a DS replication conflict happens (which will manifest as incorrect numSubordinates), is it something that DS will be able to detect and resolve automatically, or does it require manual intervention?

AFAIK DS does not have automatic remediation.

If DS can resolve it automatically, would the startup code become unblocked eventually (so no need for timeout)? Or is there a mechanism for DS to notify PKI server about the conflict resolution so it can interrupt the current persistent search and start a new one?

Not applicable.

If it requires manual intervention, should PKI server actually terminate such that it doesn't provide incorrect information and/or cause further damage?

We are not detecting conflict entries in general, only in one subtree (and, again, it might not actually be a conflict entry at all - the whole thing about inflated numSubordinates is that we can't see the entries that inflated it!) My take is that the correct behaviour is to continue startup, and let other tools handle detection and remediation.

@edewata
Copy link
Contributor

edewata commented Apr 29, 2019

OK, just a note that standalone PKI installations will not have IPA health check to detect such issue. Regardless, DS replication conflict is outside the scope of PKI so feel free to check it in if you think the risk is manageable.

@frasertweedale
Copy link
Contributor Author

OK, just a note that standalone PKI installations will not have IPA health check to detect such issue. Regardless, DS replication conflict is outside the scope of PKI so feel free to check it in if you think the risk is manageable.

Right, I meant to mention earlier but forgot: standalone PKI installations use file-based profiles by default, and LWCAs are only supported in IPA context, so yeah I think it is low-risk :) Thanks for the review @edewata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants