-
Notifications
You must be signed in to change notification settings - Fork 404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout #900
Comments
Can you try my repository? If work. I can push the plcconnection bug |
@chrisdutz can you look into this? |
Sorry for the late response. I was able to reproduce the problem. |
Interrestingly it did recover after quite some time ... will look into this. Run 0: Value: 1,000000 |
So I guess the problem is, that we have a connection timeout that is longer than your 1 second timeout ... so it looks as if the connection fails and the next connect runs into the void ... we then wait for this to timeout and then create a new attempt ... if the connection is available again, it connects then and the application recovers or if it's not available, we wait for the next connection timeout and then try again. |
Ok ... keeping the connection separated longer than the connection timeout resulted in a different set of errors. Unfortunately I didn't quite understand your solution or couldn't find it in the branch. |
@spnettec I tried your branch a long time ago and seemed to work. I only used your changes to the connection-cache in that testing. @chrisdutz When I created this issue I didn't get a single re-connection. I have pulled from develop and I also get re-connections. Will try the nifi integration that was not working. Will let you know if it works properly!
If you are looking for my changes you can view them here. |
@spnettec I think I didn't accept your PR as I think it addressed multiple things and changed the API of PLC4X in an undesirable way. That was why I tried implementing an alternate path. @QuanticPony thanks for the pointer ... that link helps a lot. Will have a look at it after work. |
Ok ... so if I have a look at the changes it seems your branch is quite diverged from develop ... I'd like to help work on fixing any issues you might be having here. |
Hi again. I had a look into the nifi integration after updating the branch: Not sure how should we handle the reconnection from the nifi integration part. Right now using it to connect to an opcua plc means after a disconnection a user must restart manually the processors or restart nifi. Tried again with the changes I made. It reconnects in both cases |
I think my main problem is, how can I simulate the situation in a unit-test to prove the cache works correctly. |
@chrisdutz I have made a unit test for the behavior that I think is correct. You can check it here The problem that I see is that any exception while (for example) reading from the plc triggers the LeasedPlcConnection to be invalidated. But a timeout due to a broken connection does not. |
Thanks for that ... of course do I first have to finish my changes on the subscription API first, or nothing will compile ;-) |
Ok ... so the first test passes in my setup ... however the second one I'm not really sure how it should fail ... the timeout is on the request execution ... so the client says "give me that in 50 ms" this timeout is only in the CompletableFuture and the driver has no way to know how long the client is willing to wait. I guess I should think of a way, that the operation times out internally (without waiting 10 seconds or so) Or how does the driver know about your completable future timeout? |
The connection cache gives a leased connection with the real connection inside. If an exception occurs in the real connection the leased connection is invalidated. This is the first test. I think is a problem that if the client gets a timeout in the leased connection the CompletableFuture of the real connection, the one that invalidates the leased connection, does not. This is the case that breaks the cache in my case, as that connection is no longer usable. I think a timeout in the leased connection should propagate into the real connection. Making the real connection invalid. Else the client has no way of removing the real connection from the connection cache. I see 2 solutions. Either propagate the timeout or allow the client to manually invalidate a connection from the cache |
Well I did locally change the LeasedConnection to simply react on Exception instead of PlcRuntimeException ... this should now catch timeout errors too ... but I'm looking, if we shouldn't be catcing the timeouts and converting them to Plc-exceptions. |
Think I found it ... so the NettyHashTimerTimeoutManager.java in line 54 creates a Timeout exception ... this is not a PlcException or PlcRuntimeException, so it falls outside all catch blocks ... I think changing this to a PlcTimeoutException that extends PlcRuntimeException should also fix many of these issues. |
Could you folks prease try this again and give me feedback, if this issue is now fixed? |
So ... is this fixed or was the thumb up just an acknowledgement that you'll be doing that? |
Acknowledgement that I will test this next week |
@chrisdutz I have tested it in the NiFi integration with a S7 and an OPC UA. Same as last time: |
I just added two methods that allow manual removal of connections from the cache ... not tested at all ... feel free to try it out. There are now two methods:
|
@chrisdutz Implemented already in our fork. Working pretty well. Will be posting a PR soon with the changes needed for the NiFi integration to work properly again |
well ... we're planning on cutting the RC for 0.10.0 on monday ... |
What happened?
Summary
When a connection stored in the connection-cache breaks due to a network failure, the connection is not removed from the cache and blocks future uses of the same connection string.
Context
Encountered while trying to solve a similar problem as #623 in the NiFi integration:
When a processor is running and the network connection to the PLC is interrupted, the processors continues to throw errors even if the network connection is restored.
This was brought up in a mail by me (https://lists.apache.org/thread/xm38nh8xzh1m1kj0y74dx0goo81cos82) that sparked a pull request by heyoulin (#818), an issue by splatch (#821) and a commit from @chrisdutz (9b06c2d).
The commit (9b06c2d) did not fully addressed the problem, so I bring my attempt to fix it.
Replicate the problem
In order to replicate the problem use the code at the end and follow the steps:
Possible Solution
The LeasedConnection returns a Future that encapsulates the Future that connects to the PLC. The second one is the one that can mark the connection as invalid for removal. For the moment I have been able to work around this by overriding the
get
method of the first Future:You can see my solution in the zylklab fork (https://github.com/zylklab/plc4x/tree/Fix/nifi-integration-timeout). If you could give me some feedback I would like to make this into a PR as soon as posible.
Version
v0.11.0-SNAPSHOT
Programming Languages
Protocols
The text was updated successfully, but these errors were encountered: