Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Server performance & optimization #455

Closed
WolfganP opened this issue Jul 11, 2020 · 288 comments
Closed

Server performance & optimization #455

WolfganP opened this issue Jul 11, 2020 · 288 comments

Comments

@WolfganP
Copy link

Follows from #339 (comment) for better focus of the discussion.

So, as the previous issue started to explore multi-threading on the server for better use of resources, I first run a profiling of the app on debian.

Special build with:
qmake "CONFIG+=nosound headless noupcasename debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro && make clean && make -j

Then run as below, and connecting a couple of clients for a few seconds:
./jamulus --nogui --server --fastupdate

Once disconnecting the clients I gracefully killed the server
pkill -sigterm jamulus

And finally run gprof, with the results posted below:
gprof ./jamulus > gprof.txt

https://gist.github.com/WolfganP/46094fd993906321f1336494f8a5faed

It would be interesting to see those who observed high cpu usage run test sessions and collect profiling information as well to detect bottlenecks and potential code optimizations, before embarking on multi-threading analysis that may require major rewrites.

@corrados corrados added feature request Feature request improvement and removed feature request Feature request labels Jul 12, 2020
@dingodoppelt
Copy link
Contributor

dingodoppelt commented Jul 15, 2020

Hi,
I don't know if this is of much help since I don't know what I'm doing here ;) but I generated the data as instructed on my private cloud server.

https://gist.github.com/dingodoppelt/802c40b1cb13c75d96f38b9604fa22df

cheers, nils

@WolfganP
Copy link
Author

Thanks @dingodoppelt Could you please describe the test session/environment? (ie, how many clients connected, which hardware/Operating System you were running the server on, whatever you feel noticeable)

@WolfganP
Copy link
Author

@sthenos you mentioned in https://www.facebook.com/groups/507047599870191/?post_id=564455474129403&comment_id=564816464093304 that you're running the server in linux now.
Would it be possible for you to run a profiling session in any of the casual/preparation jam sessions with multiple clients to measure REAL server stress? (obviously not during the WJN main event :-)

@dingodoppelt
Copy link
Contributor

I tested with 12 clients connected from my machine with small network buffers enabled on 64 samples buffersize.
the server is a cloud hosted kvm virtual rootserver (1fire.hosting) with 2 vcpus, 1 gig of ram on ubuntu 20.04 lowlatency.
If i get the opportunity I'll repeat the test on my other server in a real life scenario.
cheers, nils

@storeilly
Copy link

Are you still interested in these data? I can run a few tests on ubuntu over the weekend.
Is there a particular release or tag we should checkout as I tried last week to compile but it froze my server.

@pljones
Copy link
Collaborator

pljones commented Jul 18, 2020

One quick comment @WolfganP -- for some reason you build command line (rather than a simple qmake) causes the TARGET = jamulus line to trigger... I don't understand why!

EDIT Dawn strikes... Yes, it does have noupcase in the CONFIG+=... I just couldn't see it...

So if you're on anything but Windows, you'll probably want to mv it back again, otherwise start up scripts, etc, won't work.

Final edit to note: jamulus.drealm.info is running with profiling. I'll leave it up over the weekend so it should amass a fair amount of data. I'll run the gprof on Monday. Obviously a bit more "real world", as I run with logging and recording enabled, so I'm expecting different number...

@pljones
Copy link
Collaborator

pljones commented Jul 19, 2020

A different view should come from the Rock and Clasical/Folk/Choir genre servers that I've just updated to r3_5_9 with profiling.

make distclean
qmake "CONFIG+=nosound headless debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro
make -j
make clean

They probably won't show much OPUS usage but this should show anything that's "weird" with server list server behaviour (although they only have about 20 registering servers, until Default).

I wasn't sure what CONFIG+=debug and -config debug added -- the code appeared to have symbols regardless.

@WolfganP
Copy link
Author

@pljones yes, I added debug flags to qmake just to make sure all symbols are included and no stripping is applied.
Anyways, a good way to check if symbols were included in the final exec and gprof instrumentation applied is objdump --syms jamulus | grep -i mcount (mcount* being the snippets of code added for profiling instrumentation)

@pljones
Copy link
Collaborator

pljones commented Jul 19, 2020

Standard build:

peter@fs-peter:~$ objdump --syms git/Jamulus-wip/Jamulus | grep -i mcount
0000000000000000       F *UND*  0000000000000000              mcount@@GLIBC_2.2.5

This just changing the binary to "Jamulus", IIRC:

peter@fs-peter:~$ objdump --syms git/Jamulus/Jamulus | grep -i mcount
0000000000000000       F *UND*  0000000000000000              mcount@@GLIBC_2.2.5

This was
make distclean; qmake "CONFIG+=nosound headless debug" "QMAKE_CXXFLAGS+=-pg" "QMAKE_LFLAGS+=-pg" -config debug Jamulus.pro; make -j; make clean

peter@fs-peter:~$ objdump --syms git/Jamulus-stable/Jamulus | grep -i mcount
0000000000000000       F *UND*  0000000000000000              mcount@@GLIBC_2.2.5

Had a few people tonight noticing additional jitter. Not everyone... Those who noticed - myself included - had just upgraded to 3.5.9. No idea why... (I "fixed" it for the evening by upping my buffer size from 64 to 128.)


14 clients connected to the server and it's looking like this in top:

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
32704 Jamulus  -21   0  320.3m  28.6m   9.1m S 10.8  0.2  80:09.58 Jamulus-server
 1066 root     -51   0    0.0m   0.0m   0.0m S  0.2  0.0  10:15.69 irq/130-enp2s0-
 1070 root     -51   0    0.0m   0.0m   0.0m S  0.1  0.0   3:27.74 irq/132-enp2s0-
 1071 root     -51   0    0.0m   0.0m   0.0m S  0.1  0.0   4:00.96 irq/133-enp2s0-

Mmm, I guess those enp2s0 IRQ handlers are a bit busy as that's the network interface... There are actually five, it seems:

peter@fs-peter:~$ ps axlww -L | grep enp2s0
1     0  1063     2  1063 -51   -      0     0 -      S    ?          0:00 [irq/129-enp2s0]
1     0  1066     2  1066 -51   -      0     0 -      S    ?         10:06 [irq/130-enp2s0-]
1     0  1068     2  1068 -51   -      0     0 -      S    ?          0:01 [irq/131-enp2s0-]
1     0  1070     2  1070 -51   -      0     0 -      S    ?          3:22 [irq/132-enp2s0-]
1     0  1071     2  1071 -51   -      0     0 -      S    ?          3:56 [irq/133-enp2s0-]

but it copes with only three in heavy demand. 129 and 131 seem left out.

Let's see what the gprof looks like in the morning :).

@pljones
Copy link
Collaborator

pljones commented Jul 20, 2020

OK, I decided to restart the central servers without profiling before I totally forget, so all the number are now in.
Jamulus-Central1 gprof.out
Jamulus-Central2 gprof.out
jamulus.drealm.info gprof.out

@WolfganP
Copy link
Author

WolfganP commented Jul 20, 2020

OK, I decided to restart the central servers without profiling before I totally forget, so all the number are now in.
Jamulus-Central1 gprof.out
Jamulus-Central2 gprof.out
jamulus.drealm.info gprof.out

Thx for the info @pljones, good to also have some performance info for the Central Server role.

Regarding the info on the audio server role, it seems it confirms the CPU impact of CServer::ProcessData and some Opus routines (I assume that as a result of the mix processing inside CServer::OnTimer), and that make sense (at least to me).
On the Opus codec front, I found this article worthy to read: https://freac.org/developer-blog-mainmenu-9/14-freac/257-introducing-superfast-conversions/ (code at https://github.com/enzo1982/superfast#superfast-codecs)

Another item I think needs some attention (or verify is already optimized) is the buffering of audio blocks to avoid unnecessary memcopies. But still reading the code :-)

@WolfganP
Copy link
Author

Are you still interested in these data? I can run a few tests on ubuntu over the weekend.
Is there a particular release or tag we should checkout as I tried last week to compile but it froze my server.

Of course @storeilly the more information the better to compare performance on diff use cases and verify common patterns of CPU usage, and direct optimization efforts.

@storeilly
Copy link

Here is a short test on GCP n1-standard-2 (2 vCPUs, 7.5 GB memory) ubuntu 18.04.
A single user connected for a few seconds. I'm running the server overnight with two instances, one on a private central and another on jamulusclassical.fischvolk.de:22524 for more data

gprof.txt

@storeilly
Copy link

jamprof01.txt
jamprof02.txt

Overnights with 1 or 2 connections... Choir meeting later so will run again after that

@WolfganP
Copy link
Author

Thanks @storeilly for the files, but those last 2 indicate a period of app usage extremely short, it doesn't even register significant stats to evaluate (even the cumulative times are in 0.00).

@storeilly
Copy link

jamprof03.txt

Oh sorry about that, maybe because I had them running as a service. I saw the message just before the choir meeting so ran this up as a live instance. We only had 8 connections for about 90 mins so I hope it is of some use.

@WolfganP
Copy link
Author

Thanks @storeilly, that latest file is more representative of a live session and similar to the others posted previously. Thx a lot for sharing.

@corrados
Copy link
Contributor

For your info: I will change the ProcessData function now to avoid some of the Double2Short calls and have a better clipping behavior.

@WolfganP
Copy link
Author

For your info: I will change the ProcessData function now to avoid some of the Double2Short calls and have a better clipping behavior.

Excellent @corrados , we can keep running profiling sessions here and there and measure the improvements.
CMovingAv (in src/util.h) is another function that is called very frequently according to the stats, do you mind to check if no unnecessary typing switch in there is performed as well?

Another thing that I couldn't still pay sufficient attention is the management of buffers of audio blocks, to make sure unnecessary memcopies are avoided. Do you recall how is it implemented?

@corrados
Copy link
Contributor

I will do further investigations when I return from my vacation.

@storeilly
Copy link

Is there a possibility somebody can build a Windows exe with profiling config? A friend is trying a multi server load test on Windows tomorrow evening?

@pljones
Copy link
Collaborator

pljones commented Jul 28, 2020

Is there a possibility somebody can build a Windows exe with profiling config? A friend is trying a multi server load test on Windows tomorrow evening?

https://docs.microsoft.com/en-us/visualstudio/profiling/running-profiling-tools-with-or-without-the-debugger?view=vs-2017
Virtually all the docs refer to working in Visual Studio using graphical tools rather than simple tools like gprof. This was about the closest I could find...

The Windows build doesn't seem to like "CONFIG+=nosound headless"

...src\../windows/sound.h(28): fatal error C1083: Cannot open include file: 'QMessageBox': No such file or directory

Leaving headless out lets the build run. (Though nosound should prevent ...src\../windows/sound.h being used, surely?)


So, Qt Creator under Windows has an "Analyze -> Performance Analyzer" tool. First thing, it kicks off the compile... Before it tries to run, it says
image
Hm. Surely it can check what OS it's running on (i.e. it's Windows!) and disable the menu item entirely? No... So hit OK. It then fails:
image
Yes, quite. And then, just to make sure you know it didn't work:
image

@dingodoppelt
Copy link
Contributor

https://gist.github.com/dingodoppelt/9fecd468be2176dacd6d6d3ae3d1d078

here is another one.
a public server (that hasn't seen too much use) on a 4 core cpu. It ran on the most recent code including the reduction of doubletoshort calls, if that is of any interest here.

@corrados
Copy link
Contributor

Thanks dingodoppelt. In your profile log the ProcessData() function is much lower in the list compared to the profile given by storeilly in jamprof03.txt. So maybe the Double2Short optimization may already have given us a faster Jamulus server.

@pljones
Copy link
Collaborator

pljones commented Aug 1, 2020

I've been dabbling with getting my service units to run at real time priority. The 2013 documentation linked from one of the guides is actually out of date. There's no need to fuss with changing cgroups from within the service unit - the latest kernels are quite happy dealing with individual slices.

[Service]
CPUSchedulingPolicy=rr
CPUSchedulingPriority=99
IOSchedulingClass=realtime
IOSchedulingPriority=3

Having said that, I noticed something when checking the status with ps:

peter@fs-peter:~/git/Jamulus-wip$ sudo chrt -r 99 sudo -u peter ./Jamulus -s -n -p 55850 -R /tmp/recording; rm -rf /tmp/recording/  - server mode chosen
- no GUI mode chosen
- selected port number: 55850
- recording directory name: /tmp/recording
Recording state enabled
 *** Jamulus, Version 3.5.10git
 *** Internet Jam Session Software
 *** Released under the GNU General Public License (GPL)

So that starts up the server real time quite happily:

peter@fs-peter:~/git/Jamulus-wip$ ps axwwH -eo user,pid,tid,spid,class,pri,comm,args | sort +5n | grep 'PID\|Jamulus' | grep 'PID\|^peter'
USER       PID   TID  SPID CLS PRI COMMAND         COMMAND
peter    11320 11322 11322 TS   19 Jamulus::CSocke ./Jamulus -s -n -p 55850 -R /tmp/recording
peter    11320 11320 11320 RR  139 Jamulus         ./Jamulus -s -n -p 55850 -R /tmp/recording
peter    11320 11321 11321 RR  139 Jamulus::JamRec ./Jamulus -s -n -p 55850 -R /tmp/recording

What I couldn't follow in the thread flow of control was why CSocket loses real time and yet JamRecorder retains it.

I'm looking to drop the priority of the jam recorder, really - and I'd have thought the socket handling code wanted to retain it?

Here's the patch that names the CSocket thread:

peter@fs-peter:~/git/Jamulus-wip$ git diff
diff --git a/src/server.cpp b/src/server.cpp
index fe9b50a8..ed1bae35 100755
--- a/src/server.cpp
+++ b/src/server.cpp
@@ -58,6 +58,8 @@ CHighPrecisionTimer::CHighPrecisionTimer ( const bool bNewUseDoubleSystemFrameSi
     veciTimeOutIntervals[1] = 1;
     veciTimeOutIntervals[2] = 0;

+    setObjectName ( "Jamulus::CHighPrecisionTimer" );
+
     // connect timer timeout signal
     QObject::connect ( &Timer, &QTimer::timeout,
         this, &CHighPrecisionTimer::OnTimer );
diff --git a/src/socket.h b/src/socket.h
index adc9c67f..b4ec15e9 100755
--- a/src/socket.h
+++ b/src/socket.h
@@ -169,7 +169,9 @@ protected:
     {
     public:
         CSocketThread ( CSocket* pNewSocket = nullptr, QObject* parent = nullptr ) :
-          QThread ( parent ), pSocket ( pNewSocket ), bRun ( true ) {}
+          QThread ( parent ), pSocket ( pNewSocket ), bRun ( true ) {
+        setObjectName ( "Jamulus::CSocketThread" );
+    }

         void Stop()
         {

(I don't know what CHighPrecisionTimer is used for but I didn't see it get a thread - maybe it's for short-lived stuff?)

@WolfganP
Copy link
Author

WolfganP commented Aug 1, 2020

The way I understood the server code, CHighPrecisionTimer is the base for the "realtime" processing of client audio mixes at ProcessData (one of the functions topping the performance charts consistently) via the OnTimer interrupt occurring every 1ms (https://github.com/corrados/jamulus/blob/f67dbd1290a579466ff1f315457ad9090b39747e/src/server.cpp#L792)

That async processing of data via the timer, was probably why the early parallelization test didn't work as intended.

@corrados
Copy link
Contributor

That's good news, thanks :-).

@plungerman
Copy link

we have a choral group with ~100 members who would like to use one of our jamulus instances with --numchannels=120 . at the moment, our jamulus 3.5.9 only supports 50. which version can i download or compile to take advantage of numchannels greater than 50? thanks in advance.

@softins
Copy link
Member

softins commented Nov 11, 2020

we have a choral group with ~100 members who would like to use one of our jamulus instances with --numchannels=120 . at the moment, our jamulus 3.5.9 only supports 50. which version can i download or compile to take advantage of numchannels greater than 50? thanks in advance.

You can use the latest version, 3.6.0, which allows --numchannels of up to 150. To support that number, you will also want to run the server on a dedicated machine with good bandwidth and several CPU cores and to enable --multithreading

@kraney
Copy link

kraney commented Nov 11, 2020

we have a choral group with ~100 members who would like to use one of our jamulus instances with --numchannels=120 . at the moment, our jamulus 3.5.9 only supports 50. which version can i download or compile to take advantage of numchannels greater than 50? thanks in advance.

You might also try my fork at https://github.com/kraney/jamulus where I've been able to boost performance further on lesser hardware. There's been some wishing for independent confirmation of the fork's performance.

@plungerman
Copy link

excellent. thanks. i will try 3.6.0 first and report back presently.

@ann0see
Copy link
Member

ann0see commented Nov 20, 2020

I made a test with ~ 20 to 30 jamulus/jack/fake sound card/headless clients with the latest 3.6.0 fork (and some mini changes from my fork) on a dual core virtual machine (Intel® Xeon® E5-2660V3) and the latency was so so. htop reported around 70% usage per core. The server was started with -F and -T

Before that I tried to connect my interface to jack on a laptop running a 6th gen intel core m5. The client went almost unresponsive and one core on the server was at 99%. Even chat messages from another device were delayed.
Sound quality was really bad.

Does that mean that slow clients can delay the server?

@dingodoppelt
Copy link
Contributor

Hi,
I tested my 4 vCore server with two jamulus servers running simultaneously in multithreaded mode. with small network buffers enabled the sound went bubbly around 20 users, as expected (CPU load at around 130/400%). i connected 35 clients which killed soundquality altogether. BUT: on the other jamulus instance running on the same server audio was perfect (around 10 more clients)
I could effectively fit almost 50 clients on the hardware (which is a virtual machine) but not in one server. Haven't checked yet if i could fit even more when launching even more server instances.
Is this to be expected or might that be a hint to some bottleneck still keeping the server from making full use of the processing power before the audio quality starts to degrade?
cheers

@corrados
Copy link
Contributor

corrados commented Dec 2, 2020

Is this to be expected

I would say, no. But it's hard to find out this bottleneck in your scenario. What is interesting is that on my 4 CPU core machine I could server 100 clients, see #455 (comment).

@dingodoppelt
Copy link
Contributor

dingodoppelt commented Dec 4, 2020

What is interesting is that on my 4 CPU core machine I could server 100 clients

When I disable small network buffers I can fit way more clients. I did my tests with small network buffers enabled (buffersize 64) and this really kills most of the servers (which are virtual machines in the cloud) at around 20 - 30 clients. The interesting part was that I could overload just one server (by connecting 35 clients, it showed pings and delays as >500ms) but not the hardware the servers were running on (the other server still did fit 10 clients before it broke). I wonder if it is possible to give that spare processing power to just one server to fit as many clients as the hardware can handle.

@corrados
Copy link
Contributor

corrados commented Dec 4, 2020

but not the hardware the servers were running on

This is an interesting point. As far as I understood, you are running a virtual server. So you cannot say that you reached the limit of the hardware but the limit of your virtual server. Maybe the virtual server limits CPU access an a per thread basis. That would explain your described behaviour.

I think if we really want to tweak the multithreading performance of the Jamulus server even further, we have to do it on a real hardware and not on a virtual server because the virtual server has too many unknowns when it comes to resource sharing.

@kraney
Copy link

kraney commented Dec 4, 2020

Two points here - it's definitely true both AWS and GCP offer instance types where you get fractional CPU cores; your instance gets "credits" that accumulate over time, and when you are actually using the CPU you spend those credits. That lets you "burst" and use the whole core for a little while, but with sustained use you'll run out of credits and get swapped out for another VM. These instances aren't a good fit for Jamulus if you're trying to maximize the number of clients you can support. There's an alternative to switching to bare hardware, which is to use an instance type that doesn't limit you to fractional cores.

Second, getting 10 more clients on a second server instance doesn't imply that there's a way to get the first instance to grow by 10 more clients instead, because the amount of work doesn't grow linearly with the number of clients. It grows by n^2. So going from 35 clients to 45 clients on a single server adds 800 new mix operations, while starting a second server instead having 10 clients only adds 100 mix operations. Having space for 100 more mix operations only gets you from 35 clients to 36, if it's all in the same server process.

@gene96817
Copy link

I am wondering if we are taking the wrong perspective on performance with cloud services. Each cloud service has different approaches to maximize utilization of their computing and networking resources. Jamulus is unique because we care about real-time performance. Most (the ideal) cloud apps care more about lots of computing in burst and less about real-time performance (or real-time to these apps are in the 100s of milliseconds). Task switching means buffering and we know buffering means latency. As we measure the load for additional clients, we should be looking at how buffering and latency changes.

@maallyn
Copy link

maallyn commented Dec 4, 2020

Folks:
I have a Linode dedicated (2 CPU) machine (newark-music.allyn.com) which is at the latest stable (3.6.1) compiled with config = nosound. I did not notice any configs for multi-thread or buffer size when I looked at the Jamulus.pro file.

I hope this can help. I am willing to spend the extra for a dedicated 4 cpu for a short while if you thank that will help.

@dingodoppelt
Copy link
Contributor

Folks:
I have a Linode dedicated (2 CPU) machine (newark-music.allyn.com) which is at the latest stable (3.6.1) compiled with config = nosound. I did not notice any configs for multi-thread or buffer size when I looked at the Jamulus.pro file.

hi there,
i couldn't fit more than 20 clients on your machine. are you running it with -F and -T parameters?

I have been playing around with sysbench, a tool for performance measurement and i found that cloud server performance is pretty good cpu-wise but awful for memory performance where my dedicated machine really shines. i ran this test on my home machine:

sysbench --threads=`nproc` memory run 
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 8
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 104857600 (19856820.40 per second)

102400.00 MiB transferred (19391.43 MiB/sec)


General statistics:
    total time:                          5.2791s
    total number of events:              104857600

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    1.84
         95th percentile:                        0.00
         sum:                                28463.16

Threads fairness:
    events (avg/stddev):           13107200.0000/0.00
    execution time (avg/stddev):   3.5579/0.08

and on my cloud server:

sysbench --threads=`nproc` memory run
sysbench 1.0.18 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 4
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 21430802 (2142010.21 per second)

20928.52 MiB transferred (2091.81 MiB/sec)


General statistics:
    total time:                          10.0010s
    total number of events:              21430802

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                   10.76
         95th percentile:                        0.00
         sum:                                34558.68

Threads fairness:
    events (avg/stddev):           5357700.5000/266087.46
    execution time (avg/stddev):   8.6397/0.05

this doesn't look too good in comparison. maybe this is the bottleneck? do you think sysbench could be a reliable tool to measure server performance instead of trial and error or are there any other tools i could try?

@pljones
Copy link
Collaborator

pljones commented Dec 6, 2020

How much memory is used by a Jamulus server thread?

@brynalf
Copy link

brynalf commented Dec 6, 2020

On my windows system at home a server uses only about 60 MB memory.
It scales slightly with the number of attached clients. Example: 0 clients ==> 56 MB, 50 clients ==> 59 MB.
Edit: You probably were asking about a specific case above, so never mind my answer :)

@pljones
Copy link
Collaborator

pljones commented Dec 6, 2020

Edit: You probably were asking about a specific case above, so never mind my answer :)

I meant Jamulus memory usage, as you've given.

The test was about memory throughput, if I read it correctly. If Jamulus isn't memory constrained, then the test shown won't be representative of Jamulus performance.

@dingodoppelt
Copy link
Contributor

The test was about memory throughput, if I read it correctly. If Jamulus isn't memory constrained, then the test shown won't be > representative of Jamulus performance.

I just wondered if that might be the issue with the cloud servers. The CPU performance is fine and doesn't really deviate from what I measure on real hardware. The only thing I could find using sysbench was the restricted performance on memory throughput in comparison to real hardware so I figured this might be another thing to look at since cloud servers die long before the CPU is used up.

@pljones
Copy link
Collaborator

pljones commented Dec 7, 2020

The CPU performance is fine and doesn't really deviate from what I measure on real hardware.

On average, that may be true. Are you getting a reading for consistency of performance - i.e. how much the CPU performance deviates between maximum throughput and minimum? As noted above, it's that stability that Jamulus needs and which directly affects its capacity.

@corrados
Copy link
Contributor

corrados commented Dec 7, 2020

@dingodoppelt I don't know if you are on facebook, but there is a report about successfully have 53 clients connected to a Jamulus server on a 4 CPU virtual server: https://www.facebook.com/groups/619274602254947/permalink/811257479723324: "Had 53 members of a youth orchestra this evening on Jamulus (and another 15-20 listening on Zoom). Took about 90 minutes of setup so we only got through a reading of Jingle Bells at the end but it was a great first step! AWS 4 vCPU server hit ~55%."

@dingodoppelt
Copy link
Contributor

@corrados : my server does this, too. but not with every client on small network buffers. I've played on servers with around 50 people, but you can never tell if everybody has small network buffers enabled. in my tests i connected every client with the same buffersize and small network buffers enabled. It only worked for me on dedicated hardware (namely WorldJam, Jazzlounge servers)

@sbaier1
Copy link

sbaier1 commented Dec 12, 2020

haven't done any testing / thorough research here yet, but just a heads up: there are also several kernel parameters for the UDP networking stack and general network parameters that could be tuned with sysctl that might have a positive effect:
(Is jamulus ever network-bound?)

https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt

@gene96817
Copy link

@sbaier1 I am out in the Internet frontier (aka far from the small distances at Europe) and see the network behavior being a dominant contributor to latency dependent performance. I'd be interested in a discussion about what can be done to quench traffic and discard packets. These mechanisms might be a good way to improve the performance (at the expense of audio interruption, which would be happening anyway). Especially with regards to a different thread on buffer backups (they called it buffer-bloat), the only way to manage problems in the network with packet backup at some routers would be creating some code to detect backups and quench traffic. I have some musicians that will "tolerate" 20-70 ms latency rather than not have music. Actively managing the packet rate up at 40+ ms would greatly improve the experience. (Note, I am thinking that some of the buffer back-ups is the interaction between our UDP traffic and other people's TCP cross traffic.)

@ann0see
Copy link
Member

ann0see commented Feb 6, 2021

There's a new PR for multi threading: #960

@menzels
Copy link
Contributor

menzels commented Feb 6, 2021

thanks @ann0see for pointing me to this thread. i could not read all comments here but i want to add my findings to the thread.

i noticed difficulties fitting more than about 17 clients on my hetzner cloud vServer. even though i tried configurations from 2 to 16 cores.
so i took a closer look at the multithreading code on the jamulus server side and found that it did not distribute cpu load to more cores soon enough. specifically i am talking about the MixEncodeTransmitData function that uses about 60-70% of all cpu time in jamulus.
i changed this code to distribute the load evenly among cpu cores and found that it fixed my problem.

my guess is that the cpu cores on my cloud server are not as strong as the ones the current multi threading code was developed for, i see a comment mentioning amazon cloud servers there.
i think my change is a big improvement here, because it will just distribute the load evenly between cores, and does not need to know about the power of cores.

i only tested this change with up to 21 clients. and i see much better cpu usage between cores.

@gilgongo
Copy link
Member

Hi all - in an effort to rationalise the Issues list into something just for actionable work items (we hope to apply milestones and things at some point), I'm moving this to a discussion if that's OK.

@jamulussoftware jamulussoftware locked and limited conversation to collaborators Feb 19, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests