[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout #900

QuanticPony · 2023-04-14T14:51:46Z

What happened?

Summary

When a connection stored in the connection-cache breaks due to a network failure, the connection is not removed from the cache and blocks future uses of the same connection string.

Context

Encountered while trying to solve a similar problem as #623 in the NiFi integration:
When a processor is running and the network connection to the PLC is interrupted, the processors continues to throw errors even if the network connection is restored.

This was brought up in a mail by me (https://lists.apache.org/thread/xm38nh8xzh1m1kj0y74dx0goo81cos82) that sparked a pull request by heyoulin (#818), an issue by splatch (#821) and a commit from @chrisdutz (9b06c2d).

The commit (9b06c2d) did not fully addressed the problem, so I bring my attempt to fix it.

Replicate the problem

In order to replicate the problem use the code at the end and follow the steps:

Start the main below
Disconnect network
Wait until errors are shown in the stdout
You will see the connection is been used after it fails:

16:38:22.486 [main] DEBUG o.a.p.j.u.c.CachedPlcConnectionManager.getConnection:72 - Reusing exising connection
Failed to read due to: 
java.util.concurrent.TimeoutExceptio

Reconnect network. The problem persists.

Possible Solution

The LeasedConnection returns a Future that encapsulates the Future that connects to the PLC. The second one is the one that can mark the connection as invalid for removal. For the moment I have been able to work around this by overriding the get method of the first Future:

@Override
public PlcReadResponse get(long timeout, TimeUnit unit)
        throws InterruptedException, ExecutionException, TimeoutException {
    try {
        return super.get(timeout, unit);
    } catch (TimeoutException e) {
        future.completeExceptionally(e);
        throw e;
    }
}

You can see my solution in the zylklab fork (https://github.com/zylklab/plc4x/tree/Fix/nifi-integration-timeout). If you could give me some feedback I would like to make this into a PR as soon as posible.

public class ManualTest {

    public static void main(String[] args) throws InterruptedException {
        CachedPlcConnectionManager cachedPlcConnectionManager = CachedPlcConnectionManager.getBuilder(new DefaultPlcDriverManager()).withMaxLeaseTime(Duration.ofMinutes(5)).build();
        for (int i = 0; i < 100; i++){
            Thread.sleep(1000);
            try (PlcConnection connection = cachedPlcConnectionManager.getConnection("s7://10.105.143.7:102?remote-rack=0&remote-slot=1&controller-type=S7_1200")) {
                PlcReadRequest.Builder plcReadRequestBuilder = connection.readRequestBuilder();
                plcReadRequestBuilder.addTagAddress("foo", "%DB1:DBX0.0:BOOL");
                PlcReadRequest plcReadRequest = plcReadRequestBuilder.build();
                
                PlcReadResponse plcReadResponse =  plcReadRequest.execute().get(1000, TimeUnit.MILLISECONDS);
                System.out.printf("Run %d: Value: %f%n", i, plcReadResponse.getFloat("foo"));
            } catch (Exception e) {
                System.out.println("Failed to read due to: ");
                e.printStackTrace();
            }
        }
    }
}

Version

v0.11.0-SNAPSHOT

Programming Languages

plc4j
plc4go
plc4c
plc4net

Protocols

The text was updated successfully, but these errors were encountered:

spnettec · 2023-07-03T06:04:31Z

Can you try my repository? If work. I can push the plcconnection bug

sruehl · 2023-07-03T13:39:46Z

@chrisdutz can you look into this?

chrisdutz · 2023-07-03T15:53:49Z

Sorry for the late response. I was able to reproduce the problem.

chrisdutz · 2023-07-03T15:54:17Z

Interrestingly it did recover after quite some time ... will look into this.

Run 0: Value: 1,000000
Run 1: Value: 1,000000
Run 2: Value: 1,000000
Run 3: Value: 1,000000
Run 4: Value: 1,000000
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Failed to read due to:
java.util.concurrent.TimeoutException
at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
at org.apache.plc4x.java.utils.cache.ManualTest.main(ManualTest.java:39)
Run 12: Value: 1,000000
Run 13: Value: 1,000000
Run 14: Value: 1,000000
Run 15: Value: 1,000000
Run 16: Value: 1,000000
Run 17: Value: 1,000000
Run 18: Value: 1,000000
Run 19: Value: 1,000000

chrisdutz · 2023-07-03T16:01:04Z

So I guess the problem is, that we have a connection timeout that is longer than your 1 second timeout ... so it looks as if the connection fails and the next connect runs into the void ... we then wait for this to timeout and then create a new attempt ... if the connection is available again, it connects then and the application recovers or if it's not available, we wait for the next connection timeout and then try again.

chrisdutz · 2023-07-03T16:39:24Z

Ok ... keeping the connection separated longer than the connection timeout resulted in a different set of errors. Unfortunately I didn't quite understand your solution or couldn't find it in the branch.

spnettec · 2023-07-03T18:22:35Z

It is not connection-cache bug. It is base Plc4xNettyWrapper bug exsit long long time. I fixed in #818. But you did not accept my fix. You can check #801. It is partly resove this hug bug

QuanticPony · 2023-07-04T07:51:39Z

@spnettec I tried your branch a long time ago and seemed to work. I only used your changes to the connection-cache in that testing.

@chrisdutz When I created this issue I didn't get a single re-connection. I have pulled from develop and I also get re-connections. Will try the nifi integration that was not working. Will let you know if it works properly!

Ok ... keeping the connection separated longer than the connection timeout resulted in a different set of errors. Unfortunately I didn't quite understand your solution or couldn't find it in the branch.

If you are looking for my changes you can view them here.

chrisdutz · 2023-07-04T11:40:17Z

@spnettec I think I didn't accept your PR as I think it addressed multiple things and changed the API of PLC4X in an undesirable way. That was why I tried implementing an alternate path.

@QuanticPony thanks for the pointer ... that link helps a lot. Will have a look at it after work.

chrisdutz · 2023-07-21T15:57:38Z

Ok ... so if I have a look at the changes it seems your branch is quite diverged from develop ... I'd like to help work on fixing any issues you might be having here.

QuanticPony · 2023-07-26T10:11:48Z

Hi again. I had a look into the nifi integration after updating the branch:
Found the S7 driver reconnecting, but the opcua didn't.
Tried the manual tests that posted in this issue with both drivers. Same result.
Additionally found that the s7 driver always reconnected after 15 disconnect messages.
From this I take that it depends on how the driver interprets when there is a disconnection.

Not sure how should we handle the reconnection from the nifi integration part. Right now using it to connect to an opcua plc means after a disconnection a user must restart manually the processors or restart nifi.

Tried again with the changes I made. It reconnects in both cases

chrisdutz · 2023-07-26T12:33:28Z

I think my main problem is, how can I simulate the situation in a unit-test to prove the cache works correctly.
Could you please describe the situation exactly and what you're seeing the system do? I'd try to replicate this, however I couldn't replicate the driver not coming back to life.

QuanticPony · 2023-07-26T15:50:04Z

@chrisdutz I have made a unit test for the behavior that I think is correct. You can check it here

The problem that I see is that any exception while (for example) reading from the plc triggers the LeasedPlcConnection to be invalidated. But a timeout due to a broken connection does not.

chrisdutz · 2023-07-27T07:16:55Z

Thanks for that ... of course do I first have to finish my changes on the subscription API first, or nothing will compile ;-)

chrisdutz · 2023-07-27T07:56:38Z

Ok ... so the first test passes in my setup ... however the second one I'm not really sure how it should fail ... the timeout is on the request execution ... so the client says "give me that in 50 ms" this timeout is only in the CompletableFuture and the driver has no way to know how long the client is willing to wait.

I guess I should think of a way, that the operation times out internally (without waiting 10 seconds or so)

Or how does the driver know about your completable future timeout?

QuanticPony · 2023-07-27T08:11:01Z

The connection cache gives a leased connection with the real connection inside. If an exception occurs in the real connection the leased connection is invalidated. This is the first test.

I think is a problem that if the client gets a timeout in the leased connection the CompletableFuture of the real connection, the one that invalidates the leased connection, does not. This is the case that breaks the cache in my case, as that connection is no longer usable. I think a timeout in the leased connection should propagate into the real connection. Making the real connection invalid. Else the client has no way of removing the real connection from the connection cache.
This one is the second test.

I see 2 solutions. Either propagate the timeout or allow the client to manually invalidate a connection from the cache

chrisdutz · 2023-07-27T08:13:03Z

Well I did locally change the LeasedConnection to simply react on Exception instead of PlcRuntimeException ... this should now catch timeout errors too ... but I'm looking, if we shouldn't be catcing the timeouts and converting them to Plc-exceptions.

chrisdutz · 2023-07-27T08:16:21Z

Think I found it ... so the NettyHashTimerTimeoutManager.java in line 54 creates a Timeout exception ... this is not a PlcException or PlcRuntimeException, so it falls outside all catch blocks ... I think changing this to a PlcTimeoutException that extends PlcRuntimeException should also fix many of these issues.

chrisdutz · 2023-09-22T12:01:27Z

Could you folks prease try this again and give me feedback, if this issue is now fixed?

chrisdutz · 2023-09-22T12:33:26Z

So ... is this fixed or was the thumb up just an acknowledgement that you'll be doing that?

QuanticPony · 2023-09-22T12:55:06Z

Acknowledgement that I will test this next week

QuanticPony · 2023-09-25T13:52:02Z

@chrisdutz I have tested it in the NiFi integration with a S7 and an OPC UA. Same as last time:
The S7 driver detected the invalid connection and successfully reconnected. The OPC UA driver did not.
I am pretty sure this is a driver related problem right now.
Nevertheless I would suggest adding a way of removing LeasedPlcConnections from the cache manually.
I will close this issue. Thank you for your help!

chrisdutz · 2023-09-25T14:15:53Z

I just added two methods that allow manual removal of connections from the cache ... not tested at all ... feel free to try it out. There are now two methods:

getCachedConnections which returns a set of connection urls
removeCachedConnection(String) removes a conection with the given url from the cache

QuanticPony · 2023-09-29T09:21:54Z

@chrisdutz Implemented already in our fork. Working pretty well. Will be posting a PR soon with the changes needed for the NiFi integration to work properly again

chrisdutz · 2023-09-29T09:37:17Z

well ... we're planning on cutting the RC for 0.10.0 on monday ...

QuanticPony added the bug label Apr 14, 2023

sruehl assigned chrisdutz Apr 18, 2023

QuanticPony closed this as completed Sep 25, 2023

QuanticPony mentioned this issue Sep 29, 2023

Nifi integration revision #1122

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout #900

[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout #900

QuanticPony commented Apr 14, 2023

spnettec commented Jul 3, 2023

sruehl commented Jul 3, 2023

chrisdutz commented Jul 3, 2023

chrisdutz commented Jul 3, 2023 •

edited

Loading

chrisdutz commented Jul 3, 2023

chrisdutz commented Jul 3, 2023

spnettec commented Jul 3, 2023

QuanticPony commented Jul 4, 2023 •

edited

Loading

chrisdutz commented Jul 4, 2023

chrisdutz commented Jul 21, 2023

QuanticPony commented Jul 26, 2023

chrisdutz commented Jul 26, 2023

QuanticPony commented Jul 26, 2023

chrisdutz commented Jul 27, 2023

chrisdutz commented Jul 27, 2023

QuanticPony commented Jul 27, 2023

chrisdutz commented Jul 27, 2023

chrisdutz commented Jul 27, 2023

chrisdutz commented Sep 22, 2023

chrisdutz commented Sep 22, 2023

QuanticPony commented Sep 22, 2023

QuanticPony commented Sep 25, 2023

chrisdutz commented Sep 25, 2023

QuanticPony commented Sep 29, 2023

chrisdutz commented Sep 29, 2023

[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout #900

[Bug]: plc4j-tools-connection-cache: broken connections remaing in the cache on timeout #900

Comments

QuanticPony commented Apr 14, 2023

What happened?

Summary

Context

Replicate the problem

Possible Solution

Version

Programming Languages

Protocols

spnettec commented Jul 3, 2023

sruehl commented Jul 3, 2023

chrisdutz commented Jul 3, 2023

chrisdutz commented Jul 3, 2023 • edited Loading

chrisdutz commented Jul 3, 2023

chrisdutz commented Jul 3, 2023

spnettec commented Jul 3, 2023

QuanticPony commented Jul 4, 2023 • edited Loading

chrisdutz commented Jul 4, 2023

chrisdutz commented Jul 21, 2023

QuanticPony commented Jul 26, 2023

chrisdutz commented Jul 26, 2023

QuanticPony commented Jul 26, 2023

chrisdutz commented Jul 27, 2023

chrisdutz commented Jul 27, 2023

QuanticPony commented Jul 27, 2023

chrisdutz commented Jul 27, 2023

chrisdutz commented Jul 27, 2023

chrisdutz commented Sep 22, 2023

chrisdutz commented Sep 22, 2023

QuanticPony commented Sep 22, 2023

QuanticPony commented Sep 25, 2023

chrisdutz commented Sep 25, 2023

QuanticPony commented Sep 29, 2023

chrisdutz commented Sep 29, 2023

chrisdutz commented Jul 3, 2023 •

edited

Loading

QuanticPony commented Jul 4, 2023 •

edited

Loading