-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
handling DTLS client reconnection after aiocoap server restart #224
Comments
I have no idea if my previous explanation is understood as there was no reaction, so I will try my best to rephrase it. What is current error reporting mechanism of aiocoap library to report to client (implemented using this library), that COAP server was restarted, existing DTLS session lost and server will not continues to communicate with client anymore? Such trigger than could be used to destroy existing client connection to server and create new session including new DTLS handshake. I am asking as it is currently observed that aiocoap client implementation in pytradfri doesn't receives any protocol exception/notification when such situation happens. |
I just didn't get to it over the holiday backlog yet. I thin I get the issue, trying to come up with sth over the weekend.
|
What is current error reporting mechanism of aiocoap library to report
to client (implemented using this library), that COAP server was
restarted, existing DTLS session lost and server will not continues to
communicate with client anymore?
To the application, this is ideally not shown at all (given that any
state between the client and the server should be a matter of
optimization, and it's the CoAP library's task to hide that and show
only the abstract stateless mechanism).
aiocoap should transparently create a new DTLS session if an old one
becomes unusable; that should raise an error flying out of the request
in which it was discovered as well as any other requests currently
active on the same DTLS connection (including observations). Subsequent
requests sent after the error should establish a new connection, and
ideally all use the same DTLS connection again.
That's the theory, at least -- the issues with pytradfri show that
there's a bug in there that prevents the observations from going visibly
dead when the DTLS connection fails, I'm looking into why that is.
|
I've fixed a few missing pieces in the DTLS error handling, please try whether that solves your immediate issue. Of the three error paths I've observed with the libcoap example server, two now behave well:
|
First thank you very much for reply, all the information and especially for such fast fixes. I am going to test them ASAP.
Regarding your intention to handle COAP sessions mechanism in aiocoap bellow level of the implemented client, I saw several places where pytradfri is handling some issues on its level - specially when testing results of the the .request method. But that seems to be fine for me as in such case it is good idea to report error from aiocoap to pytradfri. In most cases pytradfri is handling it by dropping and creating new aiocoap instance and therefore recreating new connection (network timeouts, missing credentials, etc.). I think, that handling some of the issues will be required on the level above aiocoap anyway, as this shall depend on the indented client logic.
Just one thing which could be confusing is term "observation", definitely when reading both tickets (aiocoap and pytradfri). In pytradfri context it refers to HomeAssistant threaded mechanism to monitor IOT device states asynchronously. In aiocoap it is IMHO way to implement COAP sessions transparently in the state machine. So to put that into current situation, HA/pytradfri observation fails to work as aiocoap observation fails to catch erroneous state triggered by COAP server (Tradfri GW) restart.
I will validate how aiocoap now behaves against IKEA Tradfri GW server implementation and let you know ASAP. But as timing could be crucial condition I do not know how my tests will be accurate. As described above, client implementation on the level above aiocoap (pytradfri) needs to handle some edge cases anyway, as not everything could by decided/handled in the session state machine in the aiocoap. So I think that at least possibility to triggering network timeout exception from aiocoap to pytradfri in such case (server restart) will help, maybe even straight away, as pytradfri will probably just drop existing aicocap instance and created new one. I would prefer such implementation over recreating transport in the aiocoap (as you mentioned) as I agree, this is not good place to be at. Signing out for testing. ;-) |
On the topic of reconnecting and where to handle errors: A network error does need to be handled by the application, simply because aiocoap can't (generally) be certain that "just trying again" is legal. Shutting all of aiocoap down and trying anew is certainly a way to deal with it, although I hope that (at least with the current improvements) just sending a new request should be fine in many cases. Ad observation: Good point -- when I'm talking of observation here, it's always about the CoAP Observe mechanism. That mechanism does have a kind of inherent mechanism to indicate when it breaks (based on Max-Age) -- does the Tradfri GW use that consistently? (Ideally, it should sent a Max-Age option that indicates when it will next send a state, or send something every 60 seconds which is default for Max-Age, but I don't know if it does that). On validating and timing: Timing should not be too crucial. If HA/pytradfri only uses observe, the reboot may take quite some time and it'd be OK (bringing us back to the Max-Age above). If they additionally do any polling, then that polling sets how much time there is for the restart. |
I fully agree that network, credentials etc error shall be handled by application. I did some first test bu it seems that retransmission mechanism doesn't work so far, I will post results bellow.
I agree that pytradfri shall handle it as well, the only issue was, that so far no event was passed out from aiocoap and I don't think it is good idea to implement timers in the pytradfri to handle "no response" state.
Thank you, I think that we have now good understanding of the aiocoap/pytradfri/HA interactions. I have no idea if Tradfri GW uses Max-Age, and even worse I have no idea how to check it out. (as it is DTLS packet dump will not help). I will repeat my test with at least 60s window to see if something will triggers (so far all observed error where triggered only by new request from client)
Tradfri GW restart seems to be actually swift, I am able to use IOT devices again in less than 60s from HA, therefore I would assume Tradfri GW could reboot bellow 60s. |
This is result from HA logs after:
To me it seems that now smething being triggered but exceptions are not handled fully in aiocoap...? Also it seems as some race condition between aiocoap retransmission mechanism and pytradfri aiocoap exception handling.
|
Just to make sure as I see a few "connection refused" and "no credentials" errors:
|
When you receive the responses, you can look into |
My previous test were using version Dec19 2020 (e181700). With this version acicoap and pytradfri where able to recover to state, when new IOT command were processed after GW restart, but no HA observations worked afterward. Current test were done with code downloaded as zip using github download menu aprox. 1hr before (as far as I understand github this is master/HEAD), but I can check commit, as code is downloaded without git metadata. But I am going to reproduce tests with git clone of the repo now.
I will do my best, but as I stated on the beginning my Python skills are ~0 (Perl person here), and pytradfri code lacks any support from previous developers. :-o |
Ok, results seems to be same as in previous test (so it seems I was using latest code), so I also performed slightly different test:
So it seems to works (but it was working similarly with version from Dec17 but without any exceptions) but:
|
OK, I think I'll have to try to reproduce this locally. I have a tradfri GW and lamp around but haven't used it much yet; can this whole trouble be also observed using pytradfri and no HA? (It'd just help making the whole setup a lot more managable.) |
I understand that it would help significantly, but unfortunately my skills are not up to the task, I am very sorry. :-o |
I'll give it a try anyway (but could take some time), thanks for bringing all this to my attention. |
I am using master of aiocoap (1af4c3f) and face the reconnection issues IKEA Tradfri gateway as well. To connect to the gateway I am using the script in #236:
Once the script is restarted, then everything works fine again. |
Also, there might be a more generic problem here.
|
Checking with few Tradfri devices, the max age sent by the gateway is 604800 seconds (168 hours or a week). According to https://datatracker.ietf.org/doc/html/rfc7641#section-3.3.1
By default, aiocoap could wait the recommended time and re-register observation with the server. However, it could be made configurable and an end-user of an application could decide the timeout value for an observation. |
Is there a workaround for this? I am seeing the same issue. |
Hello, as you are probably aware aiocoap is used in pytradfri API which is used in Homeassistant integration for IKEA Tradfri devices. Due to current abandon-ish state of both tradfri and pytradfri code I am trying to fix some long standing issues with its functionality. My knowledge is very limited in all aspects which are required for achieving such target so I am sorry for bothering you with something which is maybe not aiocoap issue at all.
this is related issue in HA repo - due to several issues being tracked there please start reading at given comment.
We observed, that after Ikea Tradfri GW (aiocoap server) is restarted Homeassistant is unable to restore communication with it, causing both devices commands and states monitoring to fail. This was observed with HA using aicoap version 0.4b3. It seems that as no exception is raised from aicoap to pytradfri it is not able to create new connection towards rebooted Tradfri GW and to reestablish communication. (pytradfri utilize DTLS using aiocoap tinydtls transport)
We also tried to use latest aiocoap version, which made situation little bit better - at least after GW restart new command are executed successfully - but async devices states monitoring is not working. This is in my limited knowledge because in current implementation it requires exception to be raised from aiocoap -> pytradfri -> HA to restart devices observation thread. But there is no exception like that so new connection to rebooted GW is never established. It seems that GW after restart is replying to old connection requests with:
TimeoutError: [Errno 110] Operation timed out
My question is: is there a way how pytradfri could be notified from the aiocoap (raising exception?) that current connection is not working anymore - so it can destroy it and create new one?
Thank you for any help.
The text was updated successfully, but these errors were encountered: