Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding retries for failed REST calls to the ISY #46

Closed
wants to merge 7 commits into from

Conversation

rccoleman
Copy link
Contributor

As per Michel @ UDI, the ISY will return 404 to an otherwise valid REST call if it thinks that the Insteon network is overwhelmed. That command may or may not actually succeed, but this patch adds a retry mechanism with a 1 second delay until either the ISY acknowledges that the command succeeded or we run out of retries.

@rccoleman
Copy link
Contributor Author

How can I move this along? I would expect that anyone sending commands to multiple devices on the ISY at once would run into the same failures that I have, so I think this would be a valuable change.

@rmkraus
Copy link
Member

rmkraus commented Mar 25, 2019

My apologies, I meant to reply to this a few weeks ago and then it slipped my mind.

So I'm back and forth on this one for a few reasons. Primarily, it is a hack for dealing with a bad response from the controller. The controller is sending the wrong error code back. The (most) correct answer to this would be to get the issue fixed with the ISY firmware.

If this code is used, then a legitimately bad request will block the thread for, at least, 5 seconds before failing out. That is also less than ideal.

Then, possibly the worst of all, if the ISY becomes overwhelmed with requests, it will likely return a percentage of requests back with 404. Based on Michael's response, a percentage of those 404's will actually send a command over the Insteon network. Then we'd repeat the requests that returned 404, a percentage of those may 404, a percentage of those will send a request over the network, and we will repeat for up to 5 seconds. I would be concerned that this would flood the Insteon network for a large enough set of requests. I think this may do more harm than good.

@rccoleman
Copy link
Contributor Author

rccoleman commented Mar 25, 2019

No problem on the delay. You're right that I'm trying to work around issues in the ISY, but it's really two-fold - first, it periodically rejects requests and maybe actually sends them, and secondly it returns the wrong error code (n my humble opinion). It would be great if we could get one or both of those resolved on the ISY side, but it seems way, way down the priority list for them, and the release schedule is glacial. If it's important to a PyISY user to fix this, I fear that the only option is through PyISY or the application that uses it.

My goal with this change was to make it as simple as possible, but I take your point about blocking the main thread. Perhaps a better solution would be to spawn a worker thread to handle the retries, but we're also limited in worker threads on the ISY (I think there are three, and they also handle node servers), so I suspect that the communication will be throttled either way. As it is, failed commands usually succeed after 1 or 2 retries, so the likelihood that we'll go through the full five is low.

Fixing the "wrong" error code is really just cosmetic for my use case, but I can see that it's problematic for a general-purpose solution. I don't know how to improve that without waiting for UDI to change their implementation.

I was also worried about flooding the network, and I think I was making it worse when I only had a 0.5s delay between retries. After extending it to 1s, it seems to back off enough to avoid flooding the Insteon network, even if a large number of devices are being turned on. Best practice for that is really to create a scene on the ISY and just turn the scene on, and performance will be better with that, too.

For me, it's all about fine tuning the backoff delay to balance Insteon network traffic with eventually having commands succeed. As it is, commands just fail sometimes, and that's a terrible user experience and essentially unusable for automation (at least for me).

I'm an embedded C dev who's just getting started with Python and I'm open to alternative implementations. I do strongly feel that some solution is required for users of the REST interface to expect predictable results.

Edit: It occured me that we should be able to differentiate "real" 404s from these failures by seeing if node is in the list of valid nodes before retrying.

@rmkraus
Copy link
Member

rmkraus commented Mar 25, 2019 via email

@rccoleman
Copy link
Contributor Author

I originally used these changes for a few days with Home Assistant while I refined the code, but got frustrated with the difficulty of maintaining custom dependencies on the Hass.io variant. I made equivalent changes with the same algorithm in the isy994 Hass component and I’ve been using it for several weeks. Before submitting this PR, I again put these changes back into my system and observed that the retries were happening as expected down in PyISY.

Without these changes, I was strongly considering just ditching the ISY and moving to native Insteon PLM support in Hass. Now that I can actually rely on the results, they bought the ISY more time.

@jimboca
Copy link
Collaborator

jimboca commented Mar 25, 2019

I wanted to review this as well but haven't had any time. Idealy tt should be using a requests session when talking to the ISY so it's not opening and closing the connection with each request.

@rccoleman
Copy link
Contributor Author

Do you have an example of that? I poked around in the ISY developer’s handbook and it wasn’t clear to me which direction to go. I don’t think it will improve the situation that this patch is targeting, though, because it seems to be related to temporary congestion on the Insteon network.

@jimboca
Copy link
Collaborator

jimboca commented Mar 27, 2019

My reference is Polyglot V2. When it tried to send many requests to the ISY it would get flooded and we got 404's, after switching to keeping the socket open with keep-alive and reusing we could push many more changes faster to the ISY. Polyglot V2 uses node.js so I can't supply a direct example, but converting PyISY to use a Python requests session would allow a socket to be reused instead of every command opening and closing a new one for each http://docs.python-requests.org/en/master/user/advanced/ It's possibly you can still get a 404 when opening the initial session, so you will still need retries, but once it's open you can pass many more commands and you are less likely to flood the ISY.

@rccoleman
Copy link
Contributor Author

Makes sense. It sounds like a separate effort from this one, considering that we still need to a backup plan for failed requests. Aren't we also competing with local traffic on the ISY, making it possible that we'll still get 404s beyond the initial connection?

@jimboca
Copy link
Collaborator

jimboca commented Mar 27, 2019

IMO, it's the same effort. When PyISY is sending many commands then it's the one creating the problem so by switching to a session you will have much more reliable (and much faster) responses to commands.

@rccoleman
Copy link
Contributor Author

From this thread, it sounds like the most likely explanations for 404 are that the ISY is unable to communicate with the device or the ISY determines that the Insteon network is "busy". As I read it, it doesn't sound like it's related to the REST commands themselves.

@jimboca
Copy link
Collaborator

jimboca commented Mar 27, 2019

My opinion is based on direct experience we had with this exact same issue while developing Polyglot V2 and working with Michel and e42 to resolve it.

@rccoleman rccoleman closed this Mar 27, 2019
@rccoleman
Copy link
Contributor Author

rccoleman commented Mar 27, 2019

@rmkraus Is it really as simple as this?

diff --git a/PyISY/Connection.py b/PyISY/Connection.py
index 6d267ce..c1b68fb 100755
--- a/PyISY/Connection.py
+++ b/PyISY/Connection.py
@@ -23,6 +23,8 @@ class Connection(object):
         self._username = username
         self._password = password
 
+        self.req_session = requests.Session()
+
         # setup proper HTTPS handling for the ISY
         if use_https and can_https(self.parent.log, tls_ver):
             self._use_https = True
@@ -31,7 +33,6 @@ class Connection(object):
             requests.packages.urllib3.disable_warnings()
 
             # ISY uses TLS1 and not SSL
-            req_session = requests.Session()
             req_session.mount(self.compileURL(None), TLSHttpAdapter(tls_ver))
         else:
             self._use_https = False
@@ -69,7 +70,7 @@ class Connection(object):
             self.parent.log.info('ISY Request: ' + url)
 
         try:
-            r = requests.get(url, auth=(self._username, self._password),
+            r = self.req_session.get(url, auth=(self._username, self._password),
                     timeout=10, verify=False)
 
         except requests.ConnectionError as err:

I'm just messing around with a little test program, but I can still successfully get the list of nodes and turn them on and off.

@rccoleman rccoleman reopened this Mar 30, 2019
@rccoleman
Copy link
Contributor Author

rccoleman commented Apr 2, 2019

Like Don Quixote, I continue :). I made the very simple changes above that I think will establish and use a persistent session and it's working as expected in my home system. I do see some retries kicking in if I try to turn lots of individual lights or scenes on at once, but it's maxed out at 3 with a large amount of immediate traffic. I don't know how to prove that the persistent session is doing the right thing other than to follow the guidance and examples online.

It turns out that it's really very easy to use custom dependencies in Hass.io by following the instructions here to specify my Github fork as a requirement, rather than the released version of PyISY. They get being pulled into a directory structure under config/deps, which gets added to the search path by "bootstrap" and is persistent across restarts.

@rccoleman
Copy link
Contributor Author

Is there any interest in this change? I'm using it and it makes it possible for me to reliably integrate ISY devices into my Hass setup, so I'm just trying to share the wealth.

@jimboca
Copy link
Collaborator

jimboca commented Apr 8, 2019

I am very interested, but haven't had a chance to test in my environment.

@rccoleman
Copy link
Contributor Author

Ok, great! No hurry.

@shbatm
Copy link
Collaborator

shbatm commented May 5, 2019

@rccoleman @jimboca: I'm working on a separate issue with Hass under #44 related to fixing bad Climate/Thermostat calls. In my changes, the isy994_control event has been updated to also expose value. This is expected to be a non-breaking change in PyISY; (PyISY's event is now a dict instead of a string, but will represent itself the same as the command string it used to pass).

This change is vital to the climate integration, as some values are only made available from the event stream, not the node definition, but I am looking for someone else to test it to confirm it is not a breaking change.

I was going to pull your changes into my PyISY_beta branch, and am currently testing in HASS using a temporary PyISY_beta PyPi package... just wondering if either of you would be willing to help me test?

@rccoleman
Copy link
Contributor Author

Sure, I can pull your branch into my production system. I'm not controlling any thermostats from my ISY, though. I'm really only using it as a hub for my Insteon switches and leak detectors.

@shbatm
Copy link
Collaborator

shbatm commented May 5, 2019

Thanks, that's OK if you don't have any thermostats, I really just want to make sure the events changes aren't going to break anything

@rccoleman
Copy link
Contributor Author

I've been using your branch for an hour or so and it seems to be working for me. I also see my retries kicking in when I turn a bunch of lights/scenes on the ISY on or off. I'll continue to monitor.

@rccoleman
Copy link
Contributor Author

So Insteon events aren't working or aren't working reliably. Specifically, I have motion sensors and wall switches that send events and they no longer trigger my Hass automations with your branch, while they work immediately when I switch back to my branch. I haven't spent time digging into what's wrong, but it's definitely not working.

@shbatm
Copy link
Collaborator

shbatm commented May 6, 2019

Thanks for the feedback. I'll look into it. I opened an issue at shbatm/PyISY#3 to not clutter this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants