-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
can't retrieve link database for devices that aren't the PLM #2
Comments
The node library is sending a slightly different message. home-controller:insteon Send: {"command":{"cmd1":"2F","cmd2":"00","extended":true,"userData":["00","00","0FFF","01"],"type":"62","checksum":194,"raw":"026233a9411F2F0000000FFF010000000000000000C2","id":"33a941"},"deferred":{"promise":{}},"timeout":5000} +1ms 2018/12/30 16:49:53 DEBUG Retrieving Device link database
This patch makes the sent packets match:
But it still hangs:
|
Is that the complete trace? It seems like there are some PLM acks missing to me. When I do the same command to one of my devices it looks like this: abates@smilinjack:~/local/src/github.com/abates/insteon/cmd/ic$ ./ic -log trace device info 4c.0e.d4
2018/12/30 21:07:34 insteon.(*Network).sendMessage:134 TRACE Sending &{00.00.00 4c.0e.d4 SD 2:2 Engine Version []} to network
2018/12/30 21:07:34 DEBUG Sending packet to port
2018/12/30 21:07:34 plm.(*Port).send:94 TRACE TX 02 62 4c 0e d4 0a 0d 00
2018/12/30 21:07:34 plm.(*PLM).receive:158 TRACE RX Send INSTEON Msg 00 00 00 4c 0e d4 0a 0d 00 ACK
2018/12/30 21:07:34 plm.(*PLM).receive:158 TRACE RX Std Msg Received 4c 0e d4 3d 96 e1 26 0d 02
2018/12/30 21:07:34 insteon.(*Network).receive:100 TRACE Received Insteon Message &{4c.0e.d4 3d.96.e1 SD Ack 2:1 Command(0x02, 0x0d, 0x02) []}
2018/12/30 21:07:34 insteon.(*Network).sendMessage:134 TRACE Sending &{00.00.00 4c.0e.d4 SD 2:2 ID Request []} to network
2018/12/30 21:07:34 DEBUG Sending packet to port
2018/12/30 21:07:35 plm.(*Port).send:94 TRACE TX 02 62 4c 0e d4 0a 10 00
2018/12/30 21:07:35 plm.(*PLM).receive:158 TRACE RX Send INSTEON Msg 00 00 00 4c 0e d4 0a 10 00 ACK
2018/12/30 21:07:35 plm.(*PLM).receive:158 TRACE RX Std Msg Received 4c 0e d4 3d 96 e1 26 10 00
2018/12/30 21:07:35 insteon.(*Network).receive:100 TRACE Received Insteon Message &{4c.0e.d4 3d.96.e1 SD Ack 2:1 Command(0x02, 0x10, 0x00) []}
2018/12/30 21:07:35 plm.(*PLM).receive:158 TRACE RX Std Msg Received 4c 0e d4 02 2a 45 83 01 72
2018/12/30 21:07:35 insteon.(*Network).receive:100 TRACE Received Insteon Message &{4c.0e.d4 02.2a.45 SB 3:0 Set-button Pressed (responder)(114) []}
Device: Switch (4c.0e.d4)
Category: 02.2a
Firmware: 0x45
2018/12/30 21:07:35 DEBUG Retrieving Device link database
2018/12/30 21:07:35 insteon.(*Network).sendMessage:134 TRACE Sending &{00.00.00 4c.0e.d4 ED 2:2 Read/Write ALDB [0 0 0 0 0 0 0 0 0 0 0 0 0 0]} to network
2018/12/30 21:07:35 DEBUG Sending packet to port
2018/12/30 21:07:35 plm.(*Port).send:94 TRACE TX 02 62 4c 0e d4 1a 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d1
2018/12/30 21:07:35 plm.(*PLM).receive:158 TRACE RX Send INSTEON Msg 00 00 00 4c 0e d4 1a 2f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d1 ACK
2018/12/30 21:07:35 plm.(*PLM).receive:158 TRACE RX Std Msg Received 4c 0e d4 3d 96 e1 26 2f 00
... |
Yes, there's no ACK from the Read ALDB command.
Here's another with similar hardware to what you're using (a Switch)
|
So the way the PLM and Insteon works is this:
Once the device ACK's the message it may respond further, as in the case with the Read/Write ALDB. In your case we should be seeing:
It seems like we're only seeing the message being sent to the PLM but the PLM never acknowledges that it received the request. This is really strange to me and I wonder if there is something wrong with your PLM? Maybe the javascript package isn't waiting for PLM ACKs? In any case, I thought I had timeouts set throughout the code so that if any Ack/Nak along the way didn't come then a timeout would occur and a specific error would bubble back up to the caller. I'll look into this specifically. |
I added a timeout as well as an additional trace log. Please pull the latest commit and re-run the trace. |
|
So the Read/Write ALDB send request is sent to the PLM and then the PLM never responds again. Serial I/O runs in a go routine and everything is traced, so it's not like something is hanging and preventing a read. Can I see a trace from the javascript package all the way through completion of querying the link database? Also, how long does that take? Maybe for some reason your PLM is taking a really long time to respond and the go package is just timing out too soon. |
Here's a trace of the JS version: It takes a minute and 5 seconds to retrieve the entire linkDB. (Setting a 120s timeout on the go version doesn't help, it's getting into some sort of stuck state.) I can factory reset the PLM, but it's not clear if I need to remove all the links to to the modem from the devices first or if that's going to cause me trouble later. (And doing that without using the modem... will be problematic.) |
This is the JS library: https://github.com/automategreen/home-controller/ |
I wouldn't factory reset the modem just yet. The one place that a read blocks indefinitely in the go package is here. I've added some logging statements to see if that is indeed where we stop getting traffic from the PLM. Please try again and update the trace. |
It never gets to the new log lines:
It looks to me like it's blocked here on reading the first byte of the PLM packet. See goroutine 19. |
Adding a timeout to the serial port, doesn't help... there just doesn't seem to be a packet to read.
|
This seems like a silly question, but you don't have another process reading from /dev/ttyUSB0, do you? |
Reading through the trace from the javascript package (and trying to analyze their source) it seems like there are a ton of send timeouts. This would explain why it is so slow. In the js package, a send timeout results in a retry several times. This also explains why the go package gives up, because it never retries if there was no ack seen. I think I can try to reproduce the behavior of the js package. Give me a few minutes. |
Not a silly question, and no. I don't have anything else reading from /dev/ttyUSB0. But, progress! Increasing writeDelay to 650ms makes things work. 600ms is too short. I'll send a PR later with that change and a few other little things. Related to debugging, I'm wondering if there's one or two extra levels of channels and goroutines in the library, that might make more sense being in the application level. But that's for a different day if it works. :) |
Okay, branch issue-2 has the updated code that retries on read timeouts. Strange that increasing writeDelay makes things work. That could (will?) make communication really really slow. Out of curiosity, how many Insteon devices do you have on your network? |
Re Debugging: I'm not sure I understand your question, but I'll be the first to admit that the plm portion of code could probably use some work. I feel like everything in insteon is in pretty good shape, but insteon/plm is not well tested or designed. |
9 devices including the PLM. the issue-2 branch hangs:
The stack trace is less obvious about what's blocking:
Have to step away from the computer for a little... back later. |
Okay, I merged the PR. @rspier: do you consider this issue resolved? I still have some lingering unease. Even if the PLM stopped sending traffic, there should have been 3 timeouts logged and then an error returned instead of the program hanging indefinitely. |
I think we should leave it open. Pr #3 is more of a workaround than a fix. I've been playing with some stuff this afternoon and timeouts are still happening in an inconsistent manner. I added some more logging (and delays) to the write path. My suspicions are leading towards a race somewhere, but I haven't nailed it down yet. (Was focusing on implementing something with the library instead.) |
I agree that it seems like some kind of race condition. I'll continue to try and determine what's going on here and let you know if I come up with anything. |
I'm re-reading the PLM developer guide and something has occurred to me. The modem does not implement hardware flow control, instead implementing software flow control by echoing the bytes received from the host. The way I wrote the current code, the echoed bytes are discarded/ignored. If we rewrite this to wait for the echoed bytes and use that as flow control it may actually fix this problem. I'll look into using the echoed bytes as flow control. |
Nope, that's not right, the echoed traffic is actually accounted for in the plm code... |
The PLM developer documentation makes this statement:
Right now, the PLM code will send a command and queue all subsequent commands until an ACK (matching packet ending in 0x06) is received. The above statement seems to indicate that earlier PLMs may rely on echoing (as opposed to acknowledging) for flow control. |
The plm/port code had a couple areas where channel closing or error conditions weren't handled quite correctly. I don't think it will affect this issue, but you might try and see what happens now. |
I think the situation might be improved (lower writeDelay), but not completely. I have to use a writeDelay of 538ms +/- 1ms to consistently read from the linkdb from dimmers. (./ic dimmer info) My PLM isn't that old, I bought it in April 2015. It's Firmware: 158, if that helps with PLM age determination. From the documentation I can find, that seems to be long after the echo flow control change. The writeDelay shouldn't be needed at all, looking at "IM RS232 Port Settings" in the manual. |
My PLM is firmware 158 as well which makes this issue even more bizarre. Regarding the writeDelay. I agree that it seems like it shouldn't be needed. However, I've found that without it, I get a ton of NAK's from the PLM when performing many subsequent operations. I've been thinking lately that the better way to do this is to always try to send the command to the PLM immediately, then, if the PLM responds with a NAK set a writeDelay and try again (up to N times). One thing I've learned is that if there is active traffic on the Insteon network (not necessarily originating from the PLM), then the PLM responds with a NAK. I have a hypothesis that back to back Insteon messages (with no delay) will result in a PLM NAK for the second message. My understanding so far is that this is due to messages being repeated by every device on the network until the hopsLeft count is zero. In the case of a moderate or large Insteon installation (I have about 70 devices on my network) this adds a noticeable delay. |
@rspier I updated a bunch of the code to remove go routines and potential race conditions... if you get a chance, can you see if the changes fixed your timeout issue? |
Sorry for the slow response... I don't think it does. I no longer need to raise the timeout, but I still need to raise the writeDelay. The actual value fluctuates. 537ms was working for a while, but 3 minutes later I need something higher. At bef47b6... Low writeDelay fails differently$ ./ic --writeDelay 536ms device info 2c.75.ca High writeDelay works ok$ ./ic --writeDelay 600ms device info 2c.75.ca --log trace slows things down enough to make it happy without writeDelay. |
I've added a ttl flag which you might want to play with to see if it changes anything (although I'm doubtful it will have any effect). I also changed the ErrReadTimeout in one place to be an ErrAckTimeout to try and determine where things are happening. Please try again, if you wouldn't mind. Also, if you could post a new trace, that would be helpful. I don't think the trace will really change, but with all the changes I made recently I just want to be sure. Thanks |
Good news and bad news. Good news: $ go run github.com/abates/insteon/cmd/ic device 2c.75.ca info Bad news: Good news: Bad news: 3 Runs, 3 outputs:
Interestingly, if it's going to fail, it fails early (4.8 seconds). Getting the entire link database is slow. (22 or 25 seconds, 19 with trace enabled.) Here's an
And a
The TTL flag is already set to it's maximum of 3. Lowering it to 1 got me one Ack timeout (likely unrelated) and 2 successes Lowering TTL to 0 returned 3 Ack Timeouts followed by an unexpected success. Setting TTL to 4 (which should be the same as 0) performed similarly. I think that's because it takes 20 seconds for the data to get back to the PLM, and the next time I ask for it, it's there. Thanks! -R |
See pull request #10 for a fix to actually set the TTL. It (surprisingly?) doesn't change any of the behavior seen previously. |
You know, I wonder if we stumbled onto something here. Due to the bug where TTL wasn't getting set, the PLM was actually sending out a TTL of zero, which means no other device on the network would propagate it. I don't think the PLM will change the ttl, either. Presumably, the receiving device gets the message the the PLM sent and responds, even though the TTL is zero. After you made the changes for PR #10, if you set the TTL to 2 or 3 does everything still work as described in your most recent notes? |
As for the read timeout above, it looks suspiciously like a race condition. There are two ID request messages sent. The response for ID request is that the device will ACK the request and then send a broadcast "set button pressed controller/responder" message. If you look at the read timeout trace, we have two ID requests, two ID request ACKs and two broadcast messages. However, one of the broadcast messages appears to come before the ID request message. This really looks like a race condition somewhere... |
Oh, and I'm now able to seemingly reproduce the problem, so yay for troubleshooting on my end. |
there is also an intermittent failing test:
It sometimes fails, but not often. |
Connection really should be synchronous, this commit removes the buffering from channels in an effort to fix issue #2
After re-reading the Insteon Developer Guide "Timeslot Synchronization" section I updated the PLM transmit code to wait for 2 * (6 or 13 depending on standard/extended messages) * ttl zero crossings (basically that value / 60). It looks like the appropriate amount of time to wait for standard messages is about 600ms and for extended length messages is 1.3s. This seems to correspond to what you are seeing for behavior on your end. If you want to test the dynamic writeDelay just set -writeDelay 0 and see how it works. relevant code here |
ref 8102653$ time go run -v ./cmd/ic device 33.14.bd info after power cycling the PLM... 7 successful runs. fastest 19.2, slowest 23.8 seconds. avg around 22 seconds. this is looking really good! Don't try and run two at once :) But that's a different problem (but maybe points to needing an external lock somewhere.) -R |
I wonder why power cycling the PLM changed the outcome? |
I meant for that to normalize test, but I think I inadvertently confused things. It does appear something was up though. without powercycling... $ time go run -v ./cmd/ic device 33.14.bd info no writeDelay=0, and it's still succeeding. 🙁 -R |
So... where do we stand with this issue? |
I think we can close it. I've been unable to trigger a failure. We can always open it again if it comes back. |
(forked from #1 (comment))
This is using a modified version of message.go that sets TTL and Max TLL to 3.
It seems to work ok, if slowly using a nodeJS library. (home-controller)
$ time ./insteon-link /dev/ttyUSB0 33a941
Connecting to /dev/ttyUSB0
Polling device: 33a941
Device found: 33a941 - Dimmable Lighting Control
Getting links for device: 33a941
Found 6 link(s):
real 0m42.134s
user 0m0.400s
sys 0m0.016s
I'll try and compare the requests between the two libraries.
The text was updated successfully, but these errors were encountered: