LinkLayer Race Condition #138
Comments
Interesting. Thanks for the report and unit test Alan. I'll have a look at fixing when time permits. |
I think we'll just have to remove the "transmitting" states altogether and transition directly into a waiting for response state. We can then have a flag for "transmitting" on the Context that must be cleared before any further actions are taken. |
Agree, it'd be elegant if the states aligned with those in the standard. |
I have a plan now to fix this and greatly simplify lots of internal to the stack. I plan to change the internal interfaces between layers to allow higher layers to only consume data when they're "ready", i.e. when they're not transmitting. We'll let the OS buffer data and throttle things. In retrospect, this is the way it should have been written originally. I think it will greatly simplify a lot of things. |
I have found a similar issue in my testing (which would cause a crash). I was having issues where OnTransmitResult (from OnWriteCallback) was received before returning out of TrySendUnconfirmed causing incorrect state transitions for pPriState. I am not sure if you are still reworking this code, but I have resolved the issue that I am having with the following mutex: In LinkContext::TryStartTransmission() around the code: And in LinkContext::OnTransmitResult(bool success) around the code: I would assume the TrySendRequestLinkStatus (a few lines before in TryStartTransmission) will also need the same mutex protection as the code currently is written. Would recommend evaluating anywhere pPriState/pSecState is modified to make sure they are thread safe. |
@cbye thanks. I intend to refactor how this works when time permits. |
@emgre This will be resolved by just removing support for link layer confirms which isn't required by the standard or useful in practice anyway. |
Data-link confirmation support was removed, so the primary state machine is greatly simplified. The |
We've noticed that on occasion we see the link layer error message:
In hunting the cause down we worked out that this was due to the extremely unlikely event that a response to a link layer message is received prior to the sender completing and transitioning from the "TransmitWait" state. This is made possible despite the use of strads because link layer transmit is split into multiple tasks. Normally, the ASIO task order would be something like:
When this Race Condition occurs the following order is realised:
This situation wouldn't generally be seen in most real world situations whereby network latency would ensure the completion of the transmit prior to receiving a response. However, in extremely low latency situations (or where the transmitter is starved of compute resources) it is possible. The impact of this issue is minor as it results in a timeout and retransmission of the link message.
Following is a unit test that demonstrates the issue, whilst RequestLinkStatus is used, this issue could conceivably impact any PRI wait state. The SEC_LINK_STATUS is thrown away as the PRI state machine is still in transmit wait despite having completed the transmit. The PRI state machine then times out (in this case waiting for a link status).
The text was updated successfully, but these errors were encountered: