Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UDP fragmentation #255

Closed
softins opened this issue May 21, 2020 · 47 comments
Closed

UDP fragmentation #255

softins opened this issue May 21, 2020 · 47 comments

Comments

@softins
Copy link
Member

softins commented May 21, 2020

https://github.com/corrados/jamulus/blob/017796919c814a74adcf916a6f200e7c157e58f1/ChangeLog#L19

Going down from 200 to 150 won't remove UDP fragmentation. In a Wireshark trace a couple of weeks ago, a list of 198 servers took 8 fragments, which when reassembled had a payload size of 6780 bytes. The fragments in that instance appeared to have an MTU of only 778 bytes, which is something over which neither end has control, as it may be due to an intermediate hop. Even if the path had the maximum possible MTU of 1500 bytes, it would still take around 4 or 5 fragments to send a list of 200 servers.

The only way really to solve it would be for the server also to listen on a TCP port (could use the same port number). A client wanting the server list could then connect to the central server on TCP, fetch the list and then close the connection. Fragmentation should no longer be a problem then.

For backward compatibility, a client could fall back to UDP if the TCP connection was refused, and the central server would still serve UDP requests for the server list if asked.

(Of course, it is only broken or misconfigured routers or computers that do not correctly handle fragmented UDP packets)

@corrados
Copy link
Contributor

Wie will have multiple server lists in the near future. Each list size can even be further reduced to, e.g., 100 or less servers. This will solve the issue.

@corrados
Copy link
Contributor

Users reported not long ago that NorthAmerica list worked but not the Default. At that time the NorthAmerica list already had about 100 entries so I assume that this will work.

@streaps
Copy link

streaps commented May 21, 2020

IMHO protocol implementation issues shouldn't dictate the size of the server list and force a fragmentation into multiple central servers. This is really a bad design. Before I read this issue I already wondered (as a user who discovered Jamulus) why there is a limit of 200.

I don't know the the code and the protocol, so I don't know how hard or easy it is to change it (compatibility to older clients and servers,...). There is also QUIC as an alternative to TCP and plain UDP. QUIC also would make it possible to connect with a web client in the future.

@corrados
Copy link
Contributor

corrados commented May 21, 2020

This is really a bad design.

Well, it worked remarkably well the last 14 years ;-).

shouldn't dictate the size of the server list

The large server list does not only have UDP fragmentation issues but others as well. E.g. the client pings all the servers in the list every some seconds. This is a fundamental thing. It is much better to have multiple small lists.
Fortunately, pljones voluteers to host two more lists. So we will have four lists with the next Jamulus release (hopefully this weekend) and maybe even more in the future.

@streaps
Copy link

streaps commented May 21, 2020

This is really a bad design.

Well, it worked remarkably well the last 14 years ;-).

I wonder why there is an open issue about it, if it still works well. ;)

@corrados
Copy link
Contributor

Well, it seems you are a bit impatient ;-)

@lefty665
Copy link

lefty665 commented May 23, 2020

I made this comment a couple of days ago on sourceforge but was redirected here.

Hi Guys, Help please. I'm in the US, and getting erratic connect lists. Sometimes it's full lists, sometimes partial lists, and most often no lists. Loading with --showallservers the servers with ping times, and not all have them, look a lot like the partial lists, when they show up at all. Most times it works the first time after a boot. Simply closing Jamulus and reopening it does not re-enable server lists. Once it quits giving lists it's done until a reboot. Older 8 core computer @4ghz, 16gb ram, but poky DSL, although far faster than spec minimums. Any suggestions, ports I should look at, etc are appreciated.

Other than connection issues I'm a fan, Jamulus has let me keep up with band members and musician friends while we're sequestered. Thank you.

ps. a fellow on sourceforge embraced the idea of turning off the stateful packet inspection portion of a users firewall and indicated he was going to add it to the Wiki Jamulus post. Doesn't seem like a great idea. https://sourceforge.net/p/llcon/discussion/533517/thread/0e9aa52428/?page=1

Thanks for your help.

@gilgongo
Copy link
Member

gilgongo commented May 23, 2020

@lefty665 The forthcoming update (3.5.4) implements a feature designed to help with the issue of seeing fewer servers than you should. This may or may not cure it, but it would be good if you could test it out. Watch out for the release announcement.

As to the firewall issue, I see they have replied to that, but essentially you will be safe without SPI unless you are doing something pretty non-standard with your network. In which case you would know whether you need SPI or not.

@corrados
Copy link
Contributor

Loading with --showallservers the servers with ping times, and not all have them, look a lot like the partial lists, when they show up at all.

Do you always see the full list if you use --showallservers (ignoring that not all servers have a ping time)?

@gilgongo
Copy link
Member

Do you always see the full list if you use --showallservers

That's certainly my own experience. Would be interesting to know about others.

@corrados
Copy link
Contributor

That's certainly my own experience.

If this is the case, then we are in the wrong Github Issue here. Because if you see the full list with --showallservers, you are not affected by the UDP fragmentation issue (see the title of this Issue) since you can receive the complete list. It is then an issue with the ping messages which are suppressed.

@gilgongo
Copy link
Member

Ah OK. I thought the two things were connected.

@lefty665
Copy link

lefty665 commented May 24, 2020

Loading with --showallservers the servers with ping times, and not all have them, look a lot like the partial lists, when they show up at all.

Do you always see the full list if you use --showallservers (ignoring that not all servers have a ping time)?

No, it will show a full list once, occasionally twice, then a blank data list until it is rebooted. If I had to guess it would more likely be a ping issue. Jamulus apparently generates a lot of pings. Could it be overrunning a buffer or exceeding a threshold? But, that's just guesses from what I can see, not an informed opinion from getting my nose in the code. Thank you for your response, and for Jamulus.

@lefty665
Copy link

lefty665 commented May 24, 2020

@lefty665 The forthcoming update (3.5.4) implements a feature designed to help with the issue of seeing fewer servers than you should. This may or may not cure it, but it would be good if you could test it out. Watch out for the release announcement.
As to the firewall issue, I see they have replied to that, but essentially you will be safe without SPI unless you are doing something pretty non-standard with your network. In which case you would know whether you need SPI or not.

I'll look for 3.54, thank you.

In reference to the firewall issue, I don't see that response, please be kind enough to direct me to it. - I found your response on sourceforge. My response is "horsefeathers". Your explanation of web threats is naïve at best. Having both hardware/router and software/os firewalls is entry level web security.

I do see from the wiki that recommending disabling SPI is now included. SPI is a fundamental feature of many better router firewalls. It is concerning that you would recommend disabling it to a user community that is often not well versed in web threats. In addition to potential damage to user computers your recommendation to trusting users creates potential liability for damages they may incur. I encourage you to remove that recommendation.

Tell me please, is Jamulus "doing something pretty non-standard with your network" traffic that would run afoul of SPI? Thanks. https://en.wikipedia.org/wiki/Stateful_firewall

@gilgongo
Copy link
Member

SPI is a fundamental feature of many better router firewalls.

That discussion isn't relevant to this ticket. Please continue it on the forum thread.

@corrados
Copy link
Contributor

3.5.4 is out. I now reduced the number of servers in the server list to 150. Can you please update this Issue with your observations if the issue of UDP fragmentation has improved after this update?

@softins
Copy link
Member Author

softins commented May 24, 2020

I've just taken a tcpdump packet trace on the back-end of my Jamulus Explorer while it fetched the server list for both Default Server and Default Server (North America). The capture file is here. You can display it in Wireshark, and can display the Jamulus protocol if you install my Jamulus dissector plugin for Wireshark.

The main server list of 150 servers took 3 fragments at the full MTU of 1500 bytes, to send a payload of 5099 bytes. The NA list of 113 servers took 2 fragments to send a payload of 4104 bytes.

To avoid fragmentation at the full MTU, you would need to limit the size of the server list to something like 50, which is a bit impractical. However, it isn't possible to guarantee the MTU between a client and server. One of my clients used an MTU of around 780 bytes, which means it needed twice as many fragments.

It was after similar investigations that I suggested the central server could additionally listen for TCP connections, and the client could make a TCP connection just to fetch the server list, and use UDP for everything else. TCP connections can perform path MTU discovery and adjust to match, avoiding the problem of fragmentation.

For more details on the Jamulus Explorer back-end, see https://github.com/softins/jamulus-php, and for the front-end see https://github.com/softins/jamulus-web

@corrados
Copy link
Contributor

Thanks for your investigation results. Let's wait what the Jamulus users report when using the new 3.5.4 version. Maybe the number of fragments is an issue (i.e. if we have too many fragments, it is an issue). So if you have less fragments it may not be an issue for the router.

@lefty665
Copy link

corrados, FYI, 3.5.4 did not change the behavior for me. While far from ideal, I have a resolution that works. That is rebooting. Thank you for what is a useful and interesting piece of software.

FWIW, a regular data list of servers does not show a server not far from me that belongs to a friend and which is a prime reason I use Jamulus. When I load Jamulus using --showallservers it is listed with a ping time, so I have learned how to connect. Dunno if that behavior gives you a hint about what is happening.

@corrados
Copy link
Contributor

If I had to guess it would more likely be a ping issue. Jamulus apparently generates a lot of pings. [...] While far from ideal, I have a resolution that works. That is rebooting.

I have no idea why a reboot helps you. Obviously, a reboot does not change the Jamulus software. Somthing else seems to be resetted when you reboot your PC. Maybe some firewall settings are reset.

@corrados
Copy link
Contributor

@lefty665 One more question: Since 3.5.4 we now have multiple server lists (genre-based lists). Do you observe the empty list issue for all of them?

@WolfganP
Copy link

It was after similar investigations that I suggested the central server could additionally listen for TCP connections, and the client could make a TCP connection just to fetch the server list, and use UDP for everything else. TCP connections can perform path MTU discovery and adjust to match, avoiding the problem of fragmentation.

I agree, TCP is a better protocol to handle receiving information as the server lists (not time sensitive, not having to impose size of content or even managing re-transmissions). An open TCP channel may also be used as a control channel for Jamulus for other tasks in the future beyond the servers lists -ie managing the triggering of recordings remotely from the client-.

@corrados
Copy link
Contributor

Even with TCP your server list will be empty because your pings will be blocked.

@WolfganP
Copy link

Even with TCP your server list will be empty because your pings will be blocked.

Some ping UDP packets may be lost somehow, but that may be treated as server item status (ie active/ unreachable due to ping error) and changed later on next refresh (ie next ping, individual UDP pkt), but the way I see it is that at least the client will start working from a validated & complete server list.

@corrados
Copy link
Contributor

corrados commented Jun 1, 2020

@lefty665 One more question: Since 3.5.4 we now have multiple server lists (genre-based lists). Do you observe the empty list issue for all of them?

This question is still open.

We have multiple server lists for about a week now. Can anybody who had issues with empty server lists report here if the issue is still present with the new Jamulus version? It is important to check all available server lists, i.e., it is important to know if only one, several or all lists are empty.

@lefty665
Copy link

lefty665 commented Jun 2, 2020

@lefty665 One more question: Since 3.5.4 we now have multiple server lists (genre-based lists). Do you observe the empty list issue for all of them?

No individually, but yes for all.

Individual Genres will redisplay multiple times, and they appear complete [count is the same]. On redisplay All Genres shows an empty list as described below.

All Genres. On the first load I'm getting a more comprehensive list and it is loading much quicker. A second connect request still generates a blank data list. Curiously, if I wait awhile [15 minutes to half an hour] it will reset and show a list instead of requiring a reboot to reset, although I suspect the list is not as complete as the initial load.

While my issue is not completely resolved, it is clear that you have markedly improved the list handling process. I'm running 3.54 client.

A message of appreciation. Thank you for all y'all are doing on Jamulus. You are clearly stepping up to the increased usage. Software development is intense work. Could y'all use a contribution to support your efforts?

ps: Individual Pan is a neat feature, it helps intelligibility and reduces fiddling with the gain sliders.
Any chance of adding a Country/Bluegrass/Folk Genre? The Classical folks would probably appreciate getting us out of their hair.

@corrados
Copy link
Contributor

corrados commented Jun 5, 2020

We still seem to have issues, see this post: https://www.facebook.com/groups/507047599870191/permalink/549049065670044
Maybe I even reduce the list size to 100 (which was actually the original value before Corona).

Interesting that the Standard list (which usually is full) can be seen but the All Genres not. Very strange... Maybe this is even not a UDP fragmentation issue then.

@softins
Copy link
Member Author

softins commented Jun 5, 2020

Maybe I even reduce the list size to 100 (which was actually the original value before Corona).

If it is fragmentation, then 100 will not be low enough. If the path from server to client supports a full 1500-byte MTU (not all paths do), that allows a UDP data length of 1458 bytes before fragmentation is needed. If we estimate the average server name as being 16 bytes and the average city name as being 12 bytes, a CLM_SERVER_LIST message for N servers will take 4+N*(6+2+16+2+0+2+12) bytes.

If we say 4+N*40=1458 we get N=36 as the most servers in the list before fragmentation at maximum MTU. That is probably not very useful!

Interesting that the Standard list (which usually is full) can be seen but the All Genres not. Very strange... Maybe this is even not a UDP fragmentation issue then.

Yes, I agree that is strange! I would need to see packet traces on the relevant client system to understand why.

@streaps
Copy link

streaps commented Jun 5, 2020

Your protocol design is broken. The list size workaround is just that: a workaround. Just fix the protocol instead of chopping up server lists.

@softins
Copy link
Member Author

softins commented Jun 5, 2020

Your protocol design is broken. The list size workaround is just that: a workaround. Just fix the protocol instead of chopping up server lists.

The chopping up is not done in the application. The IP stack will fragment datagrams that don't fit the MTU, and they will be reassembled at the receiving end, transparent to the application. One could more forcefully argue that routers are broken if they refuse to pass fragmented IP datagrams. However, it is easier to fix the application than fix people's routers!

@streaps
Copy link

streaps commented Jun 25, 2020

Sending non-realtime data over unreliable UDP and then expect datagram fragmentation to work reliable with NAT is a flawed design. Chopping up the server space based on that flawed design is a questionable workaround.

https://blog.cloudflare.com/ip-fragmentation-is-broken/

@corrados
Copy link
Contributor

The only way really to solve it would be for the server also to listen on a TCP port

I was looking at this recently. Yes, I agree that a TCP connection would solve a lot of issues. But to integrate TCP in the existing system with all the special cases and backward compatibility will be a big and also a risky (from the stability viewpoint) task to do. And you have to do it correctly, otherwise you will get all sorts of bad side effects which makes the Jamulus experience bad.

So I am now looking for alternatives. Maybe adjust the current UDP protocol stack so that it cuts large network packets to have packets which are less than about 700 bytes. Let's see if that is possible...

@corrados
Copy link
Contributor

Here are good news: #631 (comment). The new "reduced server list" seems to significantly improve the situation.

@corrados
Copy link
Contributor

corrados commented Oct 7, 2020

@softins It would be interesting to see how the new reduced server list performs if looking at the wireshark analysis.

I have not heard any positive/negative feedback since I implemented the reduced server list but the last Jamulus version is not out for a long time. I hope to read feedback from users soon which report that they now see a list were the previously had not seen a list.

@lefty665
Copy link

lefty665 commented Oct 15, 2020 via email

@corrados
Copy link
Contributor

either not sorting by ping time

If the mouse pointer is located over the table, the list is not sorted. So you have to move the mouse pointer away from the list so that it is getting sorted.

@lefty665
Copy link

lefty665 commented Oct 22, 2020 via email

@lefty665
Copy link

lefty665 commented Oct 24, 2020 via email

@corrados
Copy link
Contributor

corrados commented Jan 3, 2021

A long time ago I posted a question which is still open:

@softins It would be interesting to see how the new reduced server list performs if looking at the wireshark analysis.

I have not heard any positive/negative feedback since I implemented the reduced server list but the last Jamulus version is not out for a long time. I hope to read feedback from users soon which report that they now see a list were the previously had not seen a list.

Is there anybody who can comment on that?

@gene96817
Copy link

I have just now reviewed this topic and all the discussion. In my opinion:
1- It would be better to stay with UDP than to move to TCP.
2- It appears that some lost UDP packets would explain the problem.
A fix is for the table retrieval to include the table length and row numbers.
Then the client will know if rows are missing because of a lost UDP packet.
The client can then request just the missing rows. Then we don't care if there are links in the path that request short UDP packets (we really should not be sensitive to packet fragmentation).
3- I believe I am seeing bad table updates when there is congestion in the network path causing lost UDP packets. Perhaps changes in how pings are done at startup would help. Especially retrying pings if there is no response due to lost UDP packets.

@corrados
Copy link
Contributor

corrados commented Jan 4, 2021

Yes, Jamulus already tries over and over again to receive the list if it is not yet received. The problem for some people was that they never got any list. In your case, you get the list at some point in time.

@gene96817
Copy link

When retrieving the whole list fails, does Jamulus have a method to retrieve the list in parts? I was fishing for the distinction for (a) the retrieval only succeeds if the whole list arrives vs (b) if one of the UDP packets is lost, just the lost packet (or lost part of the table) is retried. When the probability of losing UDP packets is significant, then (b) has a better success rate. (b) is also much more tolerant of fragmentation (i.e. having lots of small packets).

@softins
Copy link
Member Author

softins commented Jan 4, 2021

When retrieving the whole list fails, does Jamulus have a method to retrieve the list in parts? I was fishing for the distinction for (a) the retrieval only succeeds if the whole list arrives vs (b) if one of the UDP packets is lost, just the lost packet (or lost part of the table) is retried. When the probability of losing UDP packets is significant, then (b) has a better success rate. (b) is also much more tolerant of fragmentation (i.e. having lots of small packets).

Sorry, just got back into this discussion. Just a few comments on the above:

  • The list of servers is only ever sent as a single UDP message, which may be as large as 6000-9000 bytes from a full central server. With such large packets, the IP message containing it must be fragmented by the networking layer, into IP fragments that are less than 1500 bytes in size, at a minimum, and possibly a bit smaller too. These are not individual UDP packets; they are component parts of a single UDP packet. If any of them is lost, the whole UDP message is undeliverable. This is why an affected client cannot display any of the server list. It cannot display just parts, it is all or nothing.
  • When the probability of losing UDP packets is significant, the display of servers is the least of your worries. Under such network conditions, the audio quality will likely be too bad for playing anyway.
  • The problem with IP fragmentation is not one of network quality, but rather of router configuration. A small minority of Jamulus users have routers which refuse to pass fragmented IP packets, or have an option to control this, which the user has not set correctly (probably due to not knowing it was there).

@corrados I haven't done any detailed Wireshark analysis of the reduced server list, because I have no problem with receiving and processing fragmented IP packets. I think I did look at it once when first implemented, and noted that it resulted in fewer fragments, but still did require fragmentation.

@gene96817
Copy link

@softins Thanks... I forgot some of the details about the router configuration.

@corrados
Copy link
Contributor

corrados commented Jan 5, 2021

I think I did look at it once when first implemented, and noted that it resulted in fewer fragments, but still did require fragmentation.

Thanks for the feedback. Yes, we still get fragmented packets but the hope is that we get fewer fragments and therefore fewer problems. Looking at the Jamulus discussion forums and Facebook groups, it seems that the complains about empty lists have not shown up recently (at least what I wrote) which is a good sign.

@lefty665
Copy link

lefty665 commented Jan 7, 2021 via email

@gilgongo
Copy link
Member

Hi All - I'm moving this to a discussion now until such time as we can firm up some actionable tickets on it.

@jamulussoftware jamulussoftware locked and limited conversation to collaborators Feb 19, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants