Skip to content
This repository has been archived by the owner on Sep 1, 2022. It is now read-only.

LinkLayer Race Condition #138

Closed
atmurray opened this issue Dec 15, 2015 · 8 comments
Closed

LinkLayer Race Condition #138

atmurray opened this issue Dec 15, 2015 · 8 comments
Labels
Milestone

Comments

@atmurray
Copy link
Contributor

We've noticed that on occasion we see the link layer error message:

Frame context not understood

In hunting the cause down we worked out that this was due to the extremely unlikely event that a response to a link layer message is received prior to the sender completing and transitioning from the "TransmitWait" state. This is made possible despite the use of strads because link layer transmit is split into multiple tasks. Normally, the ASIO task order would be something like:

  1. Some transmit is queued which transitions the PRI state machine to Transmit Wait state
  2. Buffer is sent and queues a task to transition the PRI state machine
  3. PRI state machine transitions to the Wait state
  4. Response is correctly received and queues a task to transition the PRI state machine
  5. PRI state machine transitions to Idle state

When this Race Condition occurs the following order is realised:

  1. Some transmit is queued which transitions the PRI state machine to Transmit Wait state
  2. Buffer is sent and queues a task to transition the PRI state machine
  3. Response is unexpectedly received and discarded
  4. PRI state machine transitions to the Wait state
  5. Timeout occurs in Wait state

This situation wouldn't generally be seen in most real world situations whereby network latency would ensure the completion of the transmit prior to receiving a response. However, in extremely low latency situations (or where the transmitter is starved of compute resources) it is possible. The impact of this issue is minor as it results in a timeout and retransmission of the link message.

Following is a unit test that demonstrates the issue, whilst RequestLinkStatus is used, this issue could conceivably impact any PRI wait state. The SEC_LINK_STATUS is thrown away as the PRI state machine is still in transmit wait despite having completed the transmit. The PRI state machine then times out (in this case waiting for a link status).

TEST_CASE(SUITE("KeepAliveSuccessCallbackIsInvokedWhenLinkStatusReceivedBeforeTransmitComplete"))
{
    LinkConfig config(true, false);
    config.KeepAliveTimeout = TimeDuration::Seconds(5);
    LinkLayerTest t(config);

    t.link.OnLowerLayerUp();

    REQUIRE(t.exe.NumPendingTimers() == 1);
    REQUIRE(t.listener.numKeepAliveTransmissions == 0);

    REQUIRE(t.exe.AdvanceToNextTimer());
    REQUIRE(t.exe.RunMany() > 0);

    REQUIRE(t.PopLastWriteAsHex() == LinkHex::RequestLinkStatus(true, 1024, 1));
    t.OnFrame(LinkFunction::SEC_LINK_STATUS, false, false, false, 1, 1024);
    t.link.OnTransmitResult(true);
    REQUIRE(t.listener.numKeepAliveReplys == 1);
    REQUIRE(t.exe.NumPendingTimers() == 1);
}
@jadamcrain
Copy link
Member

Interesting. Thanks for the report and unit test Alan. I'll have a look at fixing when time permits.

@jadamcrain
Copy link
Member

I think we'll just have to remove the "transmitting" states altogether and transition directly into a waiting for response state. We can then have a flag for "transmitting" on the Context that must be cleared before any further actions are taken.

@atmurray
Copy link
Contributor Author

Agree, it'd be elegant if the states aligned with those in the standard.
Presumably the only situation that would cause conforming implementations to receive an unexpected function code would be a sufficiently delayed ACK or NACK being received after transitioning to one of the Wait states. This would be symptomatic of timeouts being set too small for a given link (or degraded link performance).

@jadamcrain jadamcrain added the bug label Oct 2, 2016
@jadamcrain
Copy link
Member

I have a plan now to fix this and greatly simplify lots of internal to the stack. I plan to change the internal interfaces between layers to allow higher layers to only consume data when they're "ready", i.e. when they're not transmitting. We'll let the OS buffer data and throttle things. In retrospect, this is the way it should have been written originally. I think it will greatly simplify a lot of things.

@cbye
Copy link

cbye commented Apr 5, 2017

I have found a similar issue in my testing (which would cause a crash). I was having issues where OnTransmitResult (from OnWriteCallback) was received before returning out of TrySendUnconfirmed causing incorrect state transitions for pPriState. I am not sure if you are still reworking this code, but I have resolved the issue that I am having with the following mutex:

In LinkContext::TryStartTransmission() around the code:
transmitMutex.lock();
if (this->pSegments)
{
this->pPriState = (this->config.UseConfirms) ? &pPriState->TrySendConfirmed(*this, *pSegments) : &pPriState->TrySendUnconfirmed(*this, *pSegments);
}
transmitMutex.unlock();

And in LinkContext::OnTransmitResult(bool success) around the code:
transmitMutex.lock();
if (isPrimary)
{
this->pPriState = &this->pPriState->OnTransmitResult(*this, success);
}
else
{
this->pSecState = &this->pSecState->OnTransmitResult(*this, success);
}
transmitMutex.unlock();

I would assume the TrySendRequestLinkStatus (a few lines before in TryStartTransmission) will also need the same mutex protection as the code currently is written. Would recommend evaluating anywhere pPriState/pSecState is modified to make sure they are thread safe.

@jadamcrain
Copy link
Member

@cbye thanks. I intend to refactor how this works when time permits.

@jadamcrain jadamcrain added this to the 3.0.0 milestone May 24, 2019
@jadamcrain
Copy link
Member

@emgre This will be resolved by just removing support for link layer confirms which isn't required by the standard or useful in practice anyway.

emgre added a commit that referenced this issue Apr 22, 2020
emgre added a commit that referenced this issue Apr 23, 2020
@emgre
Copy link
Member

emgre commented Apr 27, 2020

Data-link confirmation support was removed, so the primary state machine is greatly simplified. The RequestLinkStatus was also refactored an this issue should be fixed.

@emgre emgre closed this as completed Apr 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants