Skip to content

Commit

Permalink
srmmanager: fix race condition in LoginBrokerSubscriber
Browse files Browse the repository at this point in the history
Motivation:

The `LoginBrokerSubscriber` class is responsible for maintaining a list
of doors.  It is currently used by the `srmmanager`, `frontend` and
`httpd` services.

Instances of `LoginBrokerSubscriber` receive updated information from
doors in which they provide `LoginBrokerInfo` objects.  These are
immutable objects that describe their current status.

`LoginBrokerSubscriber` maintains three distinct mappings that hold the
`LoginBrokerInfo` objects.  The `doorsByIdentity` map holds all
`LoginBrokerInfo` objects.  There are also two by-protocol mappings (one
for doors supporting writing/uploading; the other for doors supporting
reading/downloading).  In addition, there is an ordered queue to support
automatic removal of `LoginBrokerInfo` objects if no update is heard
from a door for "a while".

Only `srmmanager` uses the two by-protocol maps.  The `frontend` and
`httpd` services use only the `doorsByIdentity` mapping.

The `LoginBrokerSubscriber` code assumes that any `LoginBrokerInfo`
object that exists in either of the two by-protocol mappings also exists
in the `doorsByIdentity` mapping.  However, when accepting information
from a door, the `LoginBrokerInfo` object is first added to the
`doorsByIdentity` mapping before it is added to the by-protocol
mappings.  This violates that assumption for "a short while".

Note that the `srmmanager` cell can process incoming messages with
multiple threads and that some doors may send multiple `LoginBrokerInfo`
objects in rapid succession, particularly on start-up.  Also
`DelayQueue` uses locking to prevent concurrent updates.

In the current code, when receiving a `LoginBrokerInfo` object, the
object is added to the `doorsByIdentity` (retaining any existing object
so it may be removed), then added to the expiry queue, then the
`ByProtocolMap` mappings are updated.

Under `DelayQueue` lock contention and when `door-1` sends two
`LoginBrokerInfo` objects (`info-A` and `info-B`) in rapid succession,
the following may happen:

    Thread-A puts LoginBrokerInfo info-A (from door-1) into
    doorsByIdentity.

    Thread-A attempts to add info-A into DelayQueue; this blocks.

    Thread-B puts LoginBrokerInfo info-B (from door-1) into
    doorsByIdentity.  The put call receives info-A: stale info that
    should be removed.

    Thread-B attempts to add info-B into DelayQueue; this blocks.

    Thread-B's call to DelayQueue#add is processed.

    Thread-B adds info-B to the two ByProtocolMap mappings.

    Thread-B attempts to remove info-A from the two ByProtocolMap
    mappings.  This does nothing because Thread-A hasn't yet added
    info-A.

    Thread-A's call to DelayQueue#add is processed.

    Thread-A adds info-A to the two ByProtocolMap mappings.

The result is that `info-A` exists in the two `ByProtocolMap` mappings
but not in the `doorsByIdentity` mapping.  The `info-A` object does not
appear in `lb ls` admin command because it does not exist in
`doorsByIdentity`; however, `info-A` continues to affect TURL selection
as it exists in the `ByProtocolMap` mappings.

When it is time for `info-A` to expire (in the `DelayQueue`) the removal
fails because `info-A` does not exist in the `doorsByIdentity` mapping.
The code assumes that, if a `LoginBrokerInfo` object doesn't exist in
`doorsByIdentity` then it also doesn't exist in either of the two
`ByProtocolMap` mappings, so there's nothing to remove.

Modification:

Update order in which a `LoginBrokerInfo` object is added to the maps.
This guarantees that any object in either of the two `ByProtocolMap`
mappings also exists in `doorsByIdentity`.

Add a safety feature where the code will attempt to remove an expired
`LoginBrokerInfo` objects from the two `ByProtocolMap` mappings even if
it doesn't exist in `doorsByIdentity`.  Warn if this actually does
something.

Result:

A race condition is fixed that, if triggered, results in a memory leak.
This leak can also affect the TURLs returned by SrmManager, where
out-of-date information about doors is used.

Target: master
Requires-notes: yes
Requires-book: no
Request: 7.1
Request: 7.0
Request: 6.2
Request: 6.1
Request: 6.0
Request: 5.2
Closes: #5972
Patch: https://rb.dcache.org/r/13113/
Acked-by: Lea Morschel
  • Loading branch information
paulmillar committed Aug 25, 2021
1 parent 9863ed9 commit 2ed4135
Showing 1 changed file with 40 additions and 1 deletion.
Expand Up @@ -19,6 +19,8 @@

import com.google.common.collect.Maps;
import com.google.common.util.concurrent.MoreExecutors;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Collection;
import java.util.Collections;
Expand Down Expand Up @@ -68,6 +70,8 @@
public class LoginBrokerSubscriber
implements CellCommandListener, CellMessageReceiver, LoginBrokerSource, CellMessageSender, CellEventListener, CellLifeCycleAware
{
private static final Logger LOGGER = LoggerFactory.getLogger(LoginBrokerSubscriber.class);

public static final double EXPIRATION_FACTOR = 2.5;

/**
Expand Down Expand Up @@ -188,9 +192,19 @@ private synchronized void requestUpdate()
private void add(Entry entry)
{
if (tags.isEmpty() || !Collections.disjoint(tags, entry.getLoginBrokerInfo().getTags())) {

/* The order here is important! The LoginBrokerInfo object must be
* added to the ByProtocolMap listings (where appropriate) before
* being added to doorsByIdentity. This is to guarantee that any
* LoginBrokerInfo object in doorsByIdentity has any corresponding
* entries in the ByProtocolMap listings. Without this guarantee,
* the code may leak doors. The result would be "phantom"
* LoginBrokerInfo objects: ones in the ByProtocolMap listings but
* not in doorsByIdentity.
*/
addByProtocol(entry.info);
Entry old = doorsByIdentity.put(entry.info.getIdentifier(), entry);
queue.add(entry);
addByProtocol(entry.info);
if (old != null) {
removeByProtocol(old.info);
}
Expand All @@ -202,6 +216,14 @@ private void remove(Entry entry)
LoginBrokerInfo info = entry.getLoginBrokerInfo();
if (doorsByIdentity.remove(info.getIdentifier(), entry)) {
removeByProtocol(info);
} else {
/* Note that the code in this else-block should not be needed. It
* is included to catch any "phantom doors": LoginBrokerInfo objects
* that have already been removed from doorsByIdentifier (e.g.,
* door has sent an update) but are still present in either
* ByProtocolMap.
*/
ensureRemovedByProtocol(info);
}
}

Expand All @@ -217,6 +239,12 @@ private void removeByProtocol(LoginBrokerInfo info)
info.ifCapableOf(WRITE, writeDoors::remove);
}

private void ensureRemovedByProtocol(LoginBrokerInfo info)
{
info.ifCapableOf(READ, readDoors::ensureRemoved);
info.ifCapableOf(WRITE, writeDoors::ensureRemoved);
}

@Override
public Collection<LoginBrokerInfo> doors()
{
Expand Down Expand Up @@ -353,6 +381,17 @@ public boolean remove(LoginBrokerInfo info)
return get(info.getProtocolFamily()).remove(info);
}

/**
* We expect that the LoginBrokerInfo is absent, but we wish to ensure
* that it is missing.
*/
public void ensureRemoved(LoginBrokerInfo info)
{
if (remove(info)) {
LOGGER.warn("Removing phantom door {}", info);
}
}

public Set<LoginBrokerInfo> get(String protocol)
{
return doors.computeIfAbsent(protocol, key -> Collections.newSetFromMap(new ConcurrentHashMap<>()));
Expand Down

0 comments on commit 2ed4135

Please sign in to comment.