Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Device discovery not reliable?! #36

Closed
tuxedo0801 opened this issue Sep 26, 2016 · 15 comments
Closed

Device discovery not reliable?! #36

tuxedo0801 opened this issue Sep 26, 2016 · 15 comments
Labels

Comments

@tuxedo0801
Copy link
Contributor

Hi,

I'm using 2.3-beta and face the following problem when trying to discover KNX interface devices:

The result is not deterministic: Sometimes I find my routing and tunneling device, sometimes not. I then just need to repeat the procedure a few times, and device is again detected...

here's the related code-snippet I'm using:

List<KnxInterfaceDevice> foundDevices = new ArrayList<>(); // List of detected devices. KnxInterface Device is a simple container object that holds the relevant information for my application to setup the correct KNX connection
try {

            Enumeration<NetworkInterface> networkInterfaces = NetworkInterface.getNetworkInterfaces();

            // addresses to scan
            int count = 0;

            if (progress != null) {
                // gathering count data
                while (networkInterfaces.hasMoreElements()) {
                    NetworkInterface ni = networkInterfaces.nextElement();

                    if (ni.isLoopback() || !ni.isUp()) {
                        continue;
                    }

                    for (InterfaceAddress iaddr : ni.getInterfaceAddresses()) {
                        if (iaddr.getAddress() instanceof Inet6Address) {
                            continue;
                        }
                        count++;
                    }
                }

                networkInterfaces = NetworkInterface.getNetworkInterfaces();
            }

            int i = 0;
            while (networkInterfaces.hasMoreElements()) {
                NetworkInterface ni = networkInterfaces.nextElement();

                if (ni.isLoopback() || !ni.isUp()) {
                    continue;
                }
                log.debug("Found network interface: " + ni.getName() + "/" + ni.getDisplayName() + "/" + ni.getInterfaceAddresses());
                try {

                    for (InterfaceAddress iaddr : ni.getInterfaceAddresses()) {
                        if (iaddr.getAddress() instanceof Inet6Address) {
                            continue;
                        }
                        log.debug("Discovering on " + ni.getName() + "@" + iaddr.getAddress());
                        Discoverer discoverer = new Discoverer(iaddr.getAddress(), 0, false, false);
                        if (progress != null) {
                            i++;
                            progress.onProgress(i, count, ni, iaddr.getAddress());
                        }
                        discoverer.startSearch(ni, timeout, true);
                        SearchResponse[] result = discoverer.getSearchResponses();

                        for (SearchResponse sr : result) {

                            ServiceFamiliesDIB serviceFamilies = sr.getServiceFamilies();
                            int[] familyIds = serviceFamilies.getFamilyIds();

                            for (int n = 0; n < familyIds.length; n++) {

                                int id = familyIds[n];
                                KnxInterfaceDevice foundDevice = null;
                                switch (id) {
                                    case ServiceFamiliesDIB.ROUTING:
                                        foundDevice = new KnxRoutingDevice(ni, sr);
                                        break;
                                    case ServiceFamiliesDIB.TUNNELING:
                                        foundDevice = new KnxTunnelingDevice(ni, sr);
                                        break;
                                    default:
                                        log.warn("Unsupported device family: {}", serviceFamilies.getFamilyName(id));
                                }
                                if (foundDevice != null) {
                                    log.debug("Found: {}", foundDevice);
                                    foundDevices.add(foundDevice);
                                }

                            }

                        }
                        discoverer.clearSearchResponses();
                    }
                } catch (Exception ex) {
                    ex.printStackTrace();
                }

            }

        } catch (SocketException ex) {
            ex.printStackTrace();
        }

Some words on the implementation:

When working with multiple network interfaces on a single machine, I have to know on which network interface the KNX IP Router is running. So I first get all related network interfaces and then run a discovery on each of them. It makes no difference if I use a timeout of 1sec, 5sec or 15sec.

I personally faced this issue in Debian Testing AMD64, but some other people faced this on Win7 as well.
I have no firewall running. The PC is connected via LAN to a switch, where also the MDT IP Router is connected to (no WiFi or Fritzbox inbetween).

Any ideas what's wrong?!

@bmalinowsky
Copy link
Collaborator

Hi, for you setup the set timeout does not matter. 2 seconds or so should be plenty time.

Do you know whether there actually is a response received, and whether it was discarded?

@bertvermeiren
Copy link

Hi,

I once played with the library and even attempted to write my own java
implementation.
From what I see in the github code base there is a possibility to not see
the reply when doing discovering.

I have a lot of experience in the java world as a professional with multi
threaded behaviour.

Within the Discoverer class you see following code when sending out the
multicast packet for discovery :

s.send(new DatagramPacket(buf, buf.length, SYSTEM_SETUP_MULTICAST,
SEARCH_PORT));
synchronized (receivers) {
final ReceiverLoop l = startReceiver(s, timeout,
nifName + localAddr.getHostAddress());
receivers.add(l);
return l;
}

The time between sending out and starting up your listener thread reading
from the socket, the device can already have sent back the reply and your
reader thread is not started yet.
You should first start your reader thread before sending out the multicast.

Just my experience.

Greetings

Bert Vermeiren.

2016-09-28 11:24 GMT+02:00 bmalinowsky notifications@github.com:

Hi, for you setup the set timeout does not matter. 2 seconds or so should
be plenty time.

Do you know whether there actually is a response received, and whether it
was discarded?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#36 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALekjev6klfHO826qAdPUST88CSklRXdks5qujI_gaJpZM4KG15T
.

@tuxedo0801
Copy link
Contributor Author

Thanks Bert,
that indeed looks like a timing/threading-issue... One should definitely start the receiver before sending, otherwise response might receive before the receiver listens.

@tuxedo0801
Copy link
Contributor Author

I had a deeper look at the code and found more than one section where the receiver listens AFTER the datagram has been sent:

  1. https://github.com/calimero-project/calimero-core/blob/master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L454

  2. https://github.com/calimero-project/calimero-core/blob/master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L517

Beside that, there are a few optimization options:

a) https://github.com/calimero-project/calimero-core/blob/master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L518
You don't need to synchronize on a synchronized list when adding new items to the collection

b) https://github.com/calimero-project/calimero-core/blob/master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L385
If I not missed something very important, this section is unnecessary complex:
What sense does it make to convert the list to an array, iterate over the complete array to call quit() on every list-entry and finally remove the complete array from the list? Why not just doe an for-each iteration, call quit() on every line and finally call clear() on the list? Has this to do with the fact that you're not yet using generics?

For 2) the fix is very easy (already changed w.r.t. a) ):

// start receiver BEFORE sending out the datagram
final ReceiverLoop l = startReceiver(s, timeout, nifName + localAddr.getHostAddress());
receivers.add(l);
s.send(new DatagramPacket(buf, buf.length, SYSTEM_SETUP_MULTICAST, SEARCH_PORT));
return l;

I'll try to create a diff/patch and see if my changes will fix the issue.

@tuxedo0801
Copy link
Contributor Author

tuxedo0801 commented Sep 29, 2016

I modified the already mentioned section 1) and 2).
Works great. No more detection problems faced so far.

Here's the patch:

diff --git a/src/tuwien/auto/calimero/knxnetip/Discoverer.java b/src/tuwien/auto/calimero/knxnetip/Discoverer.java
index 0b39734..29a3dba 100644
--- a/src/tuwien/auto/calimero/knxnetip/Discoverer.java
+++ b/src/tuwien/auto/calimero/knxnetip/Discoverer.java
@@ -451,14 +451,21 @@
        try {
            final byte[] buf = PacketHelper.toPacket(new DescriptionRequest(nat ? null
                    : (InetSocketAddress) s.getLocalSocketAddress()));
+                        
+                        // start receiver thread
+                        ReceiverLoop receiver = startDescriptionReceiver(s, timeout * 1000, server, s.toString());
            s.send(new DatagramPacket(buf, buf.length, server));
-           final ReceiverLoop looper = new ReceiverLoop(s, 256, timeout * 1000,
-               server);
-           looper.loop();
-           if (looper.thrown != null)
-               throw looper.thrown;
-           if (looper.res != null)
-               return looper.res;
+                        try {
+                            // block until receiver finishs
+                            join(receiver);
+                        } catch (InterruptedException ex) {
+                            // forward the exception to outer try/catch
+                            throw new IOException(ex);
+                        }
+           if (receiver.thrown != null)
+               throw receiver.thrown;
+           if (receiver.res != null)
+               return receiver.res;
        }
        catch (final IOException e) {
            final String msg = "network failure on getting description";
@@ -513,14 +520,12 @@
            final InetSocketAddress res = mcast ? new InetSocketAddress(SYSTEM_SETUP_MULTICAST,
                    s.getLocalPort()) : nat ? null : new InetSocketAddress(localAddr,
                    s.getLocalPort());
-           final byte[] buf = PacketHelper.toPacket(new SearchRequest(res));
-           s.send(new DatagramPacket(buf, buf.length, SYSTEM_SETUP_MULTICAST, SEARCH_PORT));
-           synchronized (receivers) {
-               final ReceiverLoop l = startReceiver(s, timeout,
-                       nifName + localAddr.getHostAddress());
-               receivers.add(l);
-               return l;
-           }
+                        final byte[] buf = PacketHelper.toPacket(new SearchRequest(res));
+                        // start receiver BEFORE sending out the datagram
+                        final ReceiverLoop l = startReceiver(s, timeout, nifName + localAddr.getHostAddress());
+                        receivers.add(l);
+                        s.send(new DatagramPacket(buf, buf.length, SYSTEM_SETUP_MULTICAST, SEARCH_PORT));
+                        return l;
        }
        catch (final IOException e) {
            if (mcast)
@@ -618,6 +623,16 @@
        looper.t.start();
        return looper;
    }
+        
+        private ReceiverLoop startDescriptionReceiver(final DatagramSocket socket, final int timeout, final InetSocketAddress queriedServer,
+       final String name)
+   {
+       final ReceiverLoop looper = new ReceiverLoop(socket, 256, timeout * 1000, queriedServer);
+       looper.t = new Thread(looper, "Discoverer " + name);
+       looper.t.setDaemon(true);
+       looper.t.start();
+       return looper;
+   }

    private final class ReceiverLoop extends UdpSocketLooper implements Runnable
    {

Would be great if you @bmalinowsky could check this and create a 2.3-beta2 or so, to have this fix available before the 2.4 release.

@bertvermeiren
Copy link

for b) This code seems correct. For what I can think of - didn't read the
full class although - a new reader can be added (via startSearch) multi
threaded while calling this code.
So you don't want to remove that one from the general list.

If I should design this class, I would make sure you can only use this
discoverer instance once and not multiple times. Makes synchronization much
simpler. (with a close() method for instance)

Regards, Bert.

2016-09-29 8:24 GMT+02:00 Alex notifications@github.com:

I had a deeper look at the code and found more than one section where the
receiver listens AFTER the datagram has been sent:

  1. https://github.com/calimero-project/calimero-core/blob/
    master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L454

  2. https://github.com/calimero-project/calimero-core/blob/
    master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L517

Beside that, there are a few optimization options:

a) https://github.com/calimero-project/calimero-core/blob/
master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L518
You don't need to synchronize on a synchronized list when adding new items
to the collection

b) https://github.com/calimero-project/calimero-core/blob/
master/src/tuwien/auto/calimero/knxnetip/Discoverer.java#L385
If I not missed something very important, this section is unnecessary
complex:
What sense does it make to convert the list to an array, iterate over the
complete array to call quit() on every list-entry and finally remove the
complete array from the list? Why not just doe an for-each iteration, call
quit() on every line and finally call clear() on the list? Has this to do
with the fact that you're not yet using generics?

For 2) the fix is very easy (already changed w.r.t. a) ):

// start receiver BEFORE sending out the datagramfinal ReceiverLoop l = startReceiver(s, timeout, nifName + localAddr.getHostAddress());
receivers.add(l);
s.send(new DatagramPacket(buf, buf.length, SYSTEM_SETUP_MULTICAST, SEARCH_PORT));return l;

I'll try to create a diff/patch and see if my changes will fix the issue.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#36 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALekjQsxFCjA9PjbiUPNxAZfS5dPiJ92ks5qu1m3gaJpZM4KG15T
.

@tuxedo0801
Copy link
Contributor Author

@bmalinowsky
Any comments on the patch? Works like a charm on my side. But I would rather go for an official release instead of an own, patched version ...

@bmalinowsky
Copy link
Collaborator

I didn't look into your problem in detail due to current time limitations. Therefore, I also can't exactly state a schedule for v2.3 (which would contain a resolution to this and #37, if applicable).
Since the patch works for you, use it :)

In general (as said, without further insight), I have questions about the underlying hypothesis of Bert, being an issue wrt the order in the execution sequence. I have to take a closer look ...

@bertvermeiren
Copy link

Trust me, the network can reply much faster as your java code can run and
starts listening for the response after you send out a UDP datagram packet..
And I'm even not talking here about java garbage collection in between
those 2 operations which definitely can miss the reply.

Regards, Bert.

2016-10-11 14:52 GMT+02:00 bmalinowsky notifications@github.com:

I didn't look into your problem in detail due to current time limitations.
Therefore, I also can't exactly state a schedule for v2.3 (which would
contain a resolution to this and #37
#37, if
applicable).
Since the patch works for you, use it :)

In general (as said, without further insight), I have questions about the
underlying hypothesis of Bert, being an issue wrt the order in the
execution sequence. I have to take a closer look ...


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#36 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ALekjV2pOP0xZBEP5oUJE8a7poE3Ze9_ks5qy4aUgaJpZM4KG15T
.

@bmalinowsky
Copy link
Collaborator

@bertvermeiren

Trust me, the network can reply much faster as your java code can run and
starts listening for the response after you send out a UDP datagram packet..

I even expect it to be faster on certain devices.

By saying there is a race condition, you imply a happens-before requirement on receive -> send (and wrt that, the patch of tuxedo is actually not correct).
One could not write a single-threaded send -> receive; also not run a simple receive loop, because the next call to receive would have to happen before the current one returns (both ofc are possible to do correctly).

Sockets use rx/tx buffers. You can wait a day before invoking receive, the response will be there (usual exceptions apply).

@tuxedo0801
Copy link
Contributor Author

I still fully agree with Bert. If you use UDP, you should start listenen before you send a request. Otherwise you have a receive-gap where you can miss datagram packets.

The receive method block until a packet is received. It does AFAIK not return already received packets.

wrt my patch: Could you please explain what "is not correct" means?
Without my patch: Sometimes the packet is received slow enough so that the receiver can get it, sometimes it's fast enough so that the receiver don't get it.

Sockets use rx/tx buffers.

Of course.

You can wait a day before invoking receive, the response will be there (usual exceptions apply).

That's true for TCP streams. But with UDP you don't have a stream. You don't even have a "connection".

@bmalinowsky
Copy link
Collaborator

Otherwise you have a receive-gap where you can miss datagram packets.

You can always miss datagrams. UDP is an unreliable protocol after all.

It does AFAIK not return already received packets.

It only returns already received "packets".

Could you please explain what "is not correct" means?

Actual thread execution, and invocation of receive (note that my answer was to Bert). I won't go into another discussion here.

That's true for TCP streams. But with UDP you don't have a stream. You don't even have a "connection".

What does TCP and its definition of a connection have to do with that?

@tuxedo0801
Copy link
Contributor Author

I have to admin: You're right.

I created a short demo, sending udp datagram from A to B with a giant pause and a minimum rx buffer size of 1.. And.... datagram receives.
So maybe I mixed up something...

I'm currently sitting at a different PC, so I cannot re-test with the same environment.
With this PC, I'm not really able to reproduce the issue (2.3-beta, without my patch).

With this PC I had the situation where too many udp packets where received in tcpdump (filter was not explicit enough), so that I was not able to capture a screenshot/copy and paste a situation, where the MDT IP Router answered >5sek after the search request. My discoverer-setup is configured to 5sek.
But I only had this situation once and finally have no proof :-(

Maybe the whole issue depends on a stressed KNX interface?!

I will do some more tests with my other machine where the problem was visible and report back the findings.

@bmalinowsky
Copy link
Collaborator

Re-open if there are any conclusive findings...

@tuxedo0801
Copy link
Contributor Author

tuxedo0801 commented Dec 15, 2016

Issue still present, but no additional details available.

The provided "fix" in fact solves the issue, reproducible on windows and linux.

Beside participating on this discussion: Did you @bmalinowsky try to run the posted code snippet and thus try to reproduce?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants