New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding retries for failed REST calls to the ISY #46
Conversation
Support retrying ISY commands if they fail
How can I move this along? I would expect that anyone sending commands to multiple devices on the ISY at once would run into the same failures that I have, so I think this would be a valuable change. |
My apologies, I meant to reply to this a few weeks ago and then it slipped my mind. So I'm back and forth on this one for a few reasons. Primarily, it is a hack for dealing with a bad response from the controller. The controller is sending the wrong error code back. The (most) correct answer to this would be to get the issue fixed with the ISY firmware. If this code is used, then a legitimately bad request will block the thread for, at least, 5 seconds before failing out. That is also less than ideal. Then, possibly the worst of all, if the ISY becomes overwhelmed with requests, it will likely return a percentage of requests back with 404. Based on Michael's response, a percentage of those 404's will actually send a command over the Insteon network. Then we'd repeat the requests that returned 404, a percentage of those may 404, a percentage of those will send a request over the network, and we will repeat for up to 5 seconds. I would be concerned that this would flood the Insteon network for a large enough set of requests. I think this may do more harm than good. |
No problem on the delay. You're right that I'm trying to work around issues in the ISY, but it's really two-fold - first, it periodically rejects requests and maybe actually sends them, and secondly it returns the wrong error code (n my humble opinion). It would be great if we could get one or both of those resolved on the ISY side, but it seems way, way down the priority list for them, and the release schedule is glacial. If it's important to a PyISY user to fix this, I fear that the only option is through PyISY or the application that uses it. My goal with this change was to make it as simple as possible, but I take your point about blocking the main thread. Perhaps a better solution would be to spawn a worker thread to handle the retries, but we're also limited in worker threads on the ISY (I think there are three, and they also handle node servers), so I suspect that the communication will be throttled either way. As it is, failed commands usually succeed after 1 or 2 retries, so the likelihood that we'll go through the full five is low. Fixing the "wrong" error code is really just cosmetic for my use case, but I can see that it's problematic for a general-purpose solution. I don't know how to improve that without waiting for UDI to change their implementation. I was also worried about flooding the network, and I think I was making it worse when I only had a 0.5s delay between retries. After extending it to 1s, it seems to back off enough to avoid flooding the Insteon network, even if a large number of devices are being turned on. Best practice for that is really to create a scene on the ISY and just turn the scene on, and performance will be better with that, too. For me, it's all about fine tuning the backoff delay to balance Insteon network traffic with eventually having commands succeed. As it is, commands just fail sometimes, and that's a terrible user experience and essentially unusable for automation (at least for me). I'm an embedded C dev who's just getting started with Python and I'm open to alternative implementations. I do strongly feel that some solution is required for users of the REST interface to expect predictable results. Edit: It occured me that we should be able to differentiate "real" 404s from these failures by seeing if node is in the list of valid nodes before retrying. |
I see your points. It is a bit of a lose lose situation. Are you using this
with Home Assistant? I ask because I don’t currently have my home
automation system setup as I’m moving this year. Ideally, I’d like this
patch to be tested with hass for a few days before bumping Hass’s required
version.
On Sun, Mar 24, 2019 at 23:33 rccoleman ***@***.***> wrote:
No problem on the delay. You're right that I'm trying to work around
issues in the ISY, but it's really two-fold - first, it periodically
rejects requests and maybe actually sends them, and secondly it returns the
wrong error code (n my humble opinion). It would be great if we could get
one or both of those resolved on the ISY side, but it seems way, way down
the priority list for them, and the release schedule is glacial. If it's
important to a PyISY user to fix this, I fear that the only option is
through PyISY or the application that uses it.
My goal with this change was to make it as simple as possible, but I take
your point about blocking the main thread. Perhaps a better solution would
be to spawn a worker thread to handle the retries, but we're also limited
in worker threads on the ISY (I think there are three, and they also handle
node servers), so I suspect that the communication will be throttled either
way. As it is, failed commands usually succeed after 1 or 2 retries, so the
likelihood that we'll go through the full five is low.
I was also worried about flooding the network, and I think I was making it
worse when I only had a 0.5s delay between retries. After extending it to
1s, it seems to back off enough to avoid flooding the Insteon network, even
if a large number of devices are being turned on. Best practice for that is
really to create a scene on the ISY and just turn the scene on, and
performance will be better with that, too.
For me, it's all about fine tuning the backoff delay to balance Insteon
network traffic with eventually having commands succeed. As it is, commands
just fail sometimes, and that's a terrible user experience and essentially
unusable for automation (at least for me).
I'm an embedded C dev who's just getting started with Python and I'm open
to alternative implementations. I do strongly feel that *some* solution
is required for users of the REST interface to expect predictable results.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#46 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEuhGm-XojdS36ZvSaI0QTR3sejSdMhwks5vaEOSgaJpZM4bmaJY>
.
--
Thanks,
Ryan Kraus
|
I originally used these changes for a few days with Home Assistant while I refined the code, but got frustrated with the difficulty of maintaining custom dependencies on the Hass.io variant. I made equivalent changes with the same algorithm in the isy994 Hass component and I’ve been using it for several weeks. Before submitting this PR, I again put these changes back into my system and observed that the retries were happening as expected down in PyISY. Without these changes, I was strongly considering just ditching the ISY and moving to native Insteon PLM support in Hass. Now that I can actually rely on the results, they bought the ISY more time. |
I wanted to review this as well but haven't had any time. Idealy tt should be using a requests session when talking to the ISY so it's not opening and closing the connection with each request. |
Do you have an example of that? I poked around in the ISY developer’s handbook and it wasn’t clear to me which direction to go. I don’t think it will improve the situation that this patch is targeting, though, because it seems to be related to temporary congestion on the Insteon network. |
My reference is Polyglot V2. When it tried to send many requests to the ISY it would get flooded and we got 404's, after switching to keeping the socket open with keep-alive and reusing we could push many more changes faster to the ISY. Polyglot V2 uses node.js so I can't supply a direct example, but converting PyISY to use a Python requests session would allow a socket to be reused instead of every command opening and closing a new one for each http://docs.python-requests.org/en/master/user/advanced/ It's possibly you can still get a 404 when opening the initial session, so you will still need retries, but once it's open you can pass many more commands and you are less likely to flood the ISY. |
Makes sense. It sounds like a separate effort from this one, considering that we still need to a backup plan for failed requests. Aren't we also competing with local traffic on the ISY, making it possible that we'll still get 404s beyond the initial connection? |
IMO, it's the same effort. When PyISY is sending many commands then it's the one creating the problem so by switching to a session you will have much more reliable (and much faster) responses to commands. |
From this thread, it sounds like the most likely explanations for 404 are that the ISY is unable to communicate with the device or the ISY determines that the Insteon network is "busy". As I read it, it doesn't sound like it's related to the REST commands themselves. |
My opinion is based on direct experience we had with this exact same issue while developing Polyglot V2 and working with Michel and e42 to resolve it. |
@rmkraus Is it really as simple as this?
I'm just messing around with a little test program, but I can still successfully get the list of nodes and turn them on and off. |
Like Don Quixote, I continue :). I made the very simple changes above that I think will establish and use a persistent session and it's working as expected in my home system. I do see some retries kicking in if I try to turn lots of individual lights or scenes on at once, but it's maxed out at 3 with a large amount of immediate traffic. I don't know how to prove that the persistent session is doing the right thing other than to follow the guidance and examples online. It turns out that it's really very easy to use custom dependencies in Hass.io by following the instructions here to specify my Github fork as a requirement, rather than the released version of PyISY. They get being pulled into a directory structure under config/deps, which gets added to the search path by "bootstrap" and is persistent across restarts. |
Is there any interest in this change? I'm using it and it makes it possible for me to reliably integrate ISY devices into my Hass setup, so I'm just trying to share the wealth. |
I am very interested, but haven't had a chance to test in my environment. |
Ok, great! No hurry. |
@rccoleman @jimboca: I'm working on a separate issue with Hass under #44 related to fixing bad Climate/Thermostat calls. In my changes, the isy994_control event has been updated to also expose value. This is expected to be a non-breaking change in PyISY; (PyISY's event is now a dict instead of a string, but will represent itself the same as the command string it used to pass). This change is vital to the climate integration, as some values are only made available from the event stream, not the node definition, but I am looking for someone else to test it to confirm it is not a breaking change. I was going to pull your changes into my |
Sure, I can pull your branch into my production system. I'm not controlling any thermostats from my ISY, though. I'm really only using it as a hub for my Insteon switches and leak detectors. |
Thanks, that's OK if you don't have any thermostats, I really just want to make sure the events changes aren't going to break anything |
I've been using your branch for an hour or so and it seems to be working for me. I also see my retries kicking in when I turn a bunch of lights/scenes on the ISY on or off. I'll continue to monitor. |
So Insteon events aren't working or aren't working reliably. Specifically, I have motion sensors and wall switches that send events and they no longer trigger my Hass automations with your branch, while they work immediately when I switch back to my branch. I haven't spent time digging into what's wrong, but it's definitely not working. |
Thanks for the feedback. I'll look into it. I opened an issue at shbatm/PyISY#3 to not clutter this thread. |
As per Michel @ UDI, the ISY will return 404 to an otherwise valid REST call if it thinks that the Insteon network is overwhelmed. That command may or may not actually succeed, but this patch adds a retry mechanism with a 1 second delay until either the ISY acknowledges that the command succeeded or we run out of retries.