Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Transport Layer Connection-Oriented State Machine Style 3 #119

Closed
dallmair opened this issue Apr 10, 2023 · 2 comments
Closed

Comments

@dallmair
Copy link

This is a sort-of continuation of #96, but with sufficiently new information and different stance to warrant an own issue.

To recap, we're using the point-to-point connection-oriented communication mode to update the firmware of DIY KNX devices over the bus, using calimero-core's ManagementClientImpl class. We encounter duplicate telegrams in the Transport Layer (repetition of a telegram, i.e. with same sequence number) and consequently, the connection goes down.

How is that possible? As the firmware update requires transmission of several KiB of data, it takes a few minutes and causes bus load of ~100% during that time. In order to not affect normal bus operation too much (experience of end users), all telegrams are sent with LOW priority. That opens up the problem of repetitions: In real-world installations, there's a lot that can go wrong. For example, one of our users just needs to flip a light switch to cause broken telegrams on his TP1 installation due to coupling. When this happens, the repetition of the LOW priority connection-oriented telegram sometimes collides with another telegram with HIGH priority. The transmission order is then: 1) Original LOW priority telegram, 2) Other HIGH priority telegram, 3) Repeated LOW priority telegram. Consequently, the Data Link Layer is not able anymore to filter out the repeated telegram (as a different telegram was transmitted in the meantime) and passes on the repeated telegram to the Transport Layer. The Transport Layer sees an "incorrect" (or rather, outdated) sequence number and shuts down the connection.

To be fully clear about it: This is all perfectly aligned with the KNX spec and is the behavior that is specified in State Machine Style 1 Rationalized, Style 1, and Style 2 (not sure which one calimero-core implements).

Nevertheless, it has the very unfortunate effect that the connection goes down. And it is exactly the reason why State Machine Style 3 was specified [1]: That one is much more fault-tolerant and enables the communication partners to remain connected even when non-critical communication errors happen. Style 3 is mandatory for devices with BIM M112 and newer [2].

So the question is: Would you be willing to implement Style 3 in calimero-core?

[1] KNX spec v2.1 Chapter 3/3/4 Section 5.4.3, Pages 24ff.
[2] KNX spec v2.1 Chapter 6 Section 4.1.2, Page 31

@bmalinowsky
Copy link
Collaborator

TP1 uses CSMA/CA, how would the partial frame after a detected collision manifest itself? The tx stops sending.

The relevant part for the TL stuff of vol. 6 "profiles" is chapter 9, mgmt client, not chapter 6. There (v2.1, sec. 9.2.3.3) it specifies TL-CO style 1. Style 3 would be for the server part, but that is not active when used as client.

@dallmair
Copy link
Author

Many thanks for your fast response!

Well, I see my description above was pretty hand-wavy, sorry about that. Although it does not really matter due to your absolutely correct answer, I'd like to record what we're seeing so we don't need to go through this exercise again.

TP1 uses CSMA/CA, how would the partial frame after a detected collision manifest itself? The tx stops sending.

In general, it depends on the timing of the spike/collision, of course. However, the IP interface in the production installation we're seeing this in ignores these spikes and continues sending.

The full picture is that there are some other devices that respond with LL_NACK or LL_BUSY. Our DIY devices did not send LL_BUSY in the (distant) past and do not send it as of today, so it must be some certified device which responds with NACK/BUSY although it is not even being addressed. With more than 40 devices on the bus, it's kinda hard to identify the culprit, though.

So the flow of events as we understand it is:

  1. IP interface sends LOW priority telegram that is part of point-to-point connection-oriented communication.
  2. Our DIY device receives the telegram successfully (all parity bits and checksum are ok), sends an LL_ACK and processes the telegram.
  3. At least one other device has severe issues to distinguish between spikes and bits and for whatever reason responds with LL_NACK or LL_BUSY.
  4. Some HIGH priority telegram gets sent on the bus.
  5. IP interface repeats the LOW priority telegram from step 1. Our DIY device sees same telegram again. Due to the telegram from step 4, the repetition is not filtered out by the Data Link Layer and forwarded up to the Transport Layer, where things go downhill if Style 3 is not implemented, i.e. the connection goes down.

The relevant part for the TL stuff of vol. 6 "profiles" is chapter 9, mgmt client, not chapter 6. There (v2.1, sec. 9.2.3.3) it specifies TL-CO style 1.

Ouch, you're right. Heads down analyzing the issue I totally forgot that this is about the Management Client and mixed things up. Sorry! --> Issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants