Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeJS configuration optimization #11183

Closed
basisbit opened this issue Jan 20, 2021 · 33 comments
Closed

NodeJS configuration optimization #11183

basisbit opened this issue Jan 20, 2021 · 33 comments

Comments

@basisbit
Copy link
Collaborator

basisbit commented Jan 20, 2021

Describe the issue
We use 12-core AMD Ryzen CPUs (Ryzen 9 3900 @ ~4,1 GHz) in a bunch of our BBB servers. Until recently, the bottleneck mostly was NodeJS hitting 120% CPU usage and sticking there for a few minutes until load by users is reduced substantially again. This usually happened shortly after NodeJS reached 60% average CPU usage in htop. After some code reading and debugging, we managed to improve performance of NodeJS by approximately 40%, so that our servers can now comfortably handle up to 600 concurrent users per server (mostly students, 50% have webcams enabled, typing indicator enabled but limited to one indicator update per second) instead of 425 users previously on the same hardware with the same settings.

Now the question is: how / what files should I change in the repository in a PR so that this will be part of the next BBB 2.2.x release?

BBB version (optional):
BigBlueButton Server 2.2.31
Recording disabled, very little amounts of logging, typing indicator updates limited to once every second, whiteboard mouse hover position updates limited to once every 150ms, 3 kurento processes.

Changes applied:

  1. Open /usr/share/meteor/bundle/systemd_start.sh
  2. Replace development by production
  3. Replace the last line by PORT=3000 /usr/share/$NODE_VERSION/bin/node --max-old-space-size=4096 --max_semi_space_size=128 main.js
  4. Wait for when no users are using the server any more, then execute service bbb-html5 restart

Additional information
Before we applied this change, NodeJS occasionally crashed every few days showing FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory when trying to parse some JSON objects. Changing the max-old-space-size from the default 1,5GB to 3GB fixed those issues, but raising the limit to 4 GB helped reduce the total amount of time spent in Garbage Collection, while not raising the interruption time too much (the GC does stop-the-world garbage collection).
The --max_semi_space_size=128 was the important key to improving the amount of concurrent users.

PS: If you currently use our NodeJS CPU monitor/exporter, you'll have to update the package and then restart its service.

@ffdixon
Copy link
Member

ffdixon commented Jan 20, 2021

Thanks for sharing this basisbit. How much memory are you running with your servers?

Our document suggests the minimum memory is 16GB. Wondering if this change to systemd_start.sh affects the available memory to other parts of BigBlueButton (we'll do some testing on our end)?

@basisbit
Copy link
Collaborator Author

basisbit commented Jan 20, 2021

Our servers all have 64 or 128GB of RAM, so setting the maximum limit for NodeJS to 4GB was not a problem in our case. Regarding the current 16GB minimum in the documentation: BBB works okay even with 8 GB in some setups and 12 GB in almost every setup that I saw so far. Our servers usually use between 10 and 14 GB of RAM, depending on how long they are up and running already, depending on concurrent users count and if files are currently being converted (either conversion of recorded sessions, or conversion of pdf/png/ppt files).
The NodeJS garbage collector and the semi spaces and heap space size algorithms are very similar to those in Java: it will only allocate the memory if it is needed for decent performance and reduce heap size again when it is not needed any more (dpeending on how often the garbage collector has to be run per time frame).

As a couple of people mentioned already, hitting the 1,5GB default limit happens occasionally in production on a typical 2.2.31 server. The typical strategy would be to double the amount, thus setting max-old-space-size=3072 for default BBB server. The max_semi_space_size should not have any noticeable impact on memory usage of any BBB server.

@schrd
Copy link
Collaborator

schrd commented Jan 20, 2021

Our servers have 16 GB RAM, usually 7-9 GB are used. Within the last 90 days the maximum memory utilisation was 9.12 GB. We are running a backport of #10349 to 2.2.30 with 2 frontend workers and 1 backend worker.

I would suggest to make this operator configurable with a sensible default. Operators could set this via systemd environment NODE_OPTIONS. systemd_start.sh could then set a default value:

NODE_OPTIONS=${NODE_OPTIONS:-"--max-old-space-size=4096 --max_semi_space_size=128"}

@basisbit
Copy link
Collaborator Author

max_semi_space_size may not be set using the NODE_OPTIONS environment variable. However, it could be used for max-old-space-size.

@antobinary
Copy link
Member

typing indicator enabled but limited to one indicator update per second

This was just added as a default in 2.2 and will be part of 2.2.32
See #11199

@antobinary
Copy link
Member

whiteboard mouse hover position updates limited to once every 150ms

This was added to 2.2 with the default values unchanged (for now) but since it's configurable, you can change it without hurdles
#11187

@antobinary
Copy link
Member

  1. Replace development by production

I thought we changed this in the packages long time ago. This is what I see in 2.2 packages:

# change to start meteor in production (https) or development (http) mode
ENVIRONMENT_TYPE=production

(omitting some Mongo code)

NODE_VERSION=node-v8.17.0-linux-x64

cd /usr/share/meteor/bundle
export ROOT_URL=http://127.0.0.1/html5client
export MONGO_OPLOG_URL=mongodb://127.0.1.1/local
export MONGO_URL=mongodb://127.0.1.1/meteor
export NODE_ENV=$ENVIRONMENT_TYPE
export SERVER_WEBSOCKET_COMPRESSION=0
export BIND_IP=127.0.0.1
PORT=3000 /usr/share/$NODE_VERSION/bin/node main.js

@basisbit Perhaps there is 'development' in another spot for me to change?

=======================

... --max-old-space-size=4096 --max_semi_space_size=128 ...

so basically these two flags are what remains to fully address your initial post (please correct me if I am missing something)
For your purposes they are surely editable in the systemd_start.sh script. However, with the interest of others benefiting from this setup it would be nice to include them by default (unless causing troubles in some setups).

--max-old-space-size=4096 we have tried in the past to pass this flag with value 2048 but could not determine for sure if it was helpful, so it has not been included

--max_semi_space_size=128 this is new flag to me. After some reading it is still new :)

With both flags I am a bit hesitant to override the defaults in the package yet.
However, I see that someone was faster than me and set --max-old-space-size=4096 --max_semi_space_size=128 in demo1 (one of the demo.bigbluebutton.org servers), so this should allow for some monitoring prior to deciding on what flags to include.

@pbdco
Copy link

pbdco commented Jan 27, 2021

Thank you @basisbit for sharing this! I hope it will be included on the next .32 release as per default!

@basisbit
Copy link
Collaborator Author

basisbit commented Jan 27, 2021

@antobinary any update on your testing progress? We are still using this change successfully in production on ~300 bbb servers and so far only see the ~30% lower average nodejs cpu utilisation, but no disadvantages.

On a side note, we also did some more tests with --max-old-space-size=2048 and we still got occasional out-of-memory errors. So I'd suggest at least setting this to 3GB by default.

@basisbit Perhaps there is 'development' in another spot for me to change?

Various other bbb operators in the mean time confirmed that some of the bbb servers set up with official "2.2.31" had it set to development. I am now 99,9% sure that some of the packages were replaced in the package repository a few times since 2.2.31 release. This kind of violates the tivoization part of GPL (which also applies for LGPL), so it would be nice if such changes are not done without git tags in the future.

@fcecagno
Copy link
Member

@basisbit how do you detect the out-of-memory errors? Is FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory printed on syslog?

@basisbit
Copy link
Collaborator Author

yes, in the systemd service log.
For example it looked like this: https://gist.github.com/basisbit/7dc3929a7e6cfba75d8bfc67e3840684

@mabras
Copy link

mabras commented Feb 2, 2021

We tried this with 6 of our BBB production servers. We say a real diffrence with these params:

--max-old-space-size=4096 --max_semi_space_size=128

I notice the the memory usage ~ double the usual usage.
I would like to know if increasing the values (4096 and 128) would be any better? We have extra memory.

Thanks @basisbit

@antobinary
Copy link
Member

@antobinary any update on your testing progress?

I saw one OOM on the demo pool but it also coincided with some metrics for heap being enabled. All in all personally I would be OK to include this in the next release (documented)
Given that we iterate on 2.3 more frequently I was thinking of setting these defaults there but then I am wondering how it would affect the multiple nodejs processes. demo3 has 16GB RAM and currently I have it running (pretty stable) 4 nodejs processes. If I were to set 4GB upper boundary on each process, am I not putting myself in danger of hitting OOM due to the nature of sharing the same resources with other components? I would then be limiting my setup to 2 parallel nodejs which might reduce my number of users supported... Any thoughts?

@basisbit
Copy link
Collaborator Author

basisbit commented Feb 4, 2021

It is called max-old-space-size because NodeJS garbage collector will try to keep the old space size area small unless it is spending quite a lot of CPU time on running garbage collection too often, then it increases the size of the old space a bit, and so on, up to a maximum of that configured value.

After thinking a bit about the automatic configuration topic, I'd suggest the following as new default:

  • for 2.2.x set max-old-space-size=4096
  • for develop, set max-old-space-size=2048

This should work well as tradeoff as 2.2.x is much more likely to hit single-threading-performance limit than multi-process 2.3.x, but 2.3.x is slightly more likely to hit some RAM usage boundary when the server is used by many concurrent users. It should be a rather safe assumption that on 2.3.x it is safe to spend a bit more CPU time on running garbage collection more often than in 2.2.x because of the multi-process setup having less users per NodeJS process. So this is a CPU time vs RAM usage tradeoff and it is only noticeable when the server is at high concurrent users load.

@antobinary
Copy link
Member

I have set --max-old-space-size=2048 and --max_semi_space_size=128 in the packaged /usr/share/meteor/bundle/systemd_start.sh which is part of the 2.3-alpha6 release which is about to be released. I have also included it in the release notes (not yet published)

I have also set --max-old-space-size=4096 and --max_semi_space_size=128 in the packaged /usr/share/meteor/bundle/systemd_start.sh which is part of the 2.2.32 release which is to be released soon (no date in mind yet).

Also the default cursorInterval is 150ms now on both 2.2.x-release and 'develop', thank you for both recommendations.

Anything else related to this issue that should be adjusted? (I intend on keeping it open until both 2.2.32 and 2.3-alpha6 are out)

@basisbit
Copy link
Collaborator Author

basisbit commented Feb 6, 2021

Anything else related to this issue that should be adjusted?

Nothing that I know of. Thanks for accepting the suggested changes!

@Jossef-767
Copy link

some of my servers work fine with 4GB RAM and when i tested with --max-old-space-size=2048 and --max_semi_space_size=128 it
Double the amount of memory used, And reached the maximum available,
I do not think it is necessary to change these settings in the installation default.

@basisbit
Copy link
Collaborator Author

basisbit commented Feb 8, 2021

@Jossef-767 running BBB on a 4GB RAM system is a not supported environment. It might work somewhat, but will fail once someone uploads a fancy presentation, or when a couple of concurrent users use the service.

We have to optimize BBB for some environment and that is the recommended minimum as described in the BBB docs. You can still run BBB on a 6 or 8 GB RAM system, and that is quite a bit off from the recommended minimal requirements. From the 4 GB of memory you mentioned, half a GB is already used for the MongoDB Ramdisk - if you reduce that to 128MB, you can probably somewhat run BBB on your 4GB RAM system again. Besides that, no one will stop you from just reducing that new default nodejs setting.

@schrd
Copy link
Collaborator

schrd commented Feb 8, 2021

I have set --max-old-space-size=2048 and --max_semi_space_size=128 in the packaged /usr/share/meteor/bundle/systemd_start.sh which is part of the 2.3-alpha6 release which is about to be released. I have also included it in the release notes (not yet published)

I have also set --max-old-space-size=4096 and --max_semi_space_size=128 in the packaged /usr/share/meteor/bundle/systemd_start.sh which is part of the 2.2.32 release which is to be released soon (no date in mind yet).

I think changing an option that increases potential memory requirements during a patch level upgrade in 2.2 is not good. I'd suggest that this is

  • either set depending on the memory the machine is equipped with (for example if the system has >= 16 GB memory)
  • left as it currently is by default but introduce an option to configure this using environment variables

Otherwise this might surprise operators with machines which are short on memory. Kurento memory consumption varies a lot depending on the number of streams. For the minimum requirements of 8 GB it might be dangerous to increase the memory usage of meteor to 4 GB.

I think that increasing the default memory requirements for 2.3 would be okay.

@hex-m
Copy link
Contributor

hex-m commented Feb 8, 2021

For the minimum requirements of 8 GB it might be dangerous

The documentation already specifies 16 GB as the minimum for a production setup.

@basisbit
Copy link
Collaborator Author

basisbit commented Feb 8, 2021

I think changing an option that increases potential memory requirements during a patch level upgrade in 2.2 is not good. I'd suggest that this is

This increase is mostly to fix the "bug" of NodeJS running out of memory (and long before that already spending 50% to 75% of process CPU time for very frequent garbage collection). If this change increases memory consumption for your setup, then it was already starving previously. All the other BBB processes do not have a static upper memory limit, so imho this would be a change which is okay for a "stability" patch release which is the only release for a few more months until majority of people might deem 2.3.x as stable enough so that they reinstall all their servers.

As stated above, if you operate BBB way below the minimum requirements mentioned in the documentation, then having to rarely adjust such limits in very few cases should be expected.

@pbdco
Copy link

pbdco commented Feb 8, 2021

Maybe the install script should check how much RAM has the server installed, and IF it is more than 16gb then setting the NodeJS optimization. IF not, then it should leave as default / or set the values accordingly to the system RAM (in the right proportion).

@basisbit
Copy link
Collaborator Author

basisbit commented Feb 8, 2021

@pbdco how about the maximum limit for NodeJS is increased to the above values, and the bbb-install script only lowers this when the server has less than 12 GB of RAM (and then also shows a warning about possibly insufficient amounts of RAM)?
The goal is to have well working defaults and only require applying workarounds when the system is running way outside the expected environment. Almost everything that is done by bbb-install will be overwritten the next time the bbb server is updated and each of these bbb-install things have to be added to the other deploy methods like the ansible playbooks (which by now is most likely used for the majority of bbb server installs). Thus it would be good to reduce the need for install-bbb.sh as much as possible for typical setups. It would be nice if eventually you can do apt install bigbluebutton

@pbdco
Copy link

pbdco commented Feb 8, 2021

That sounds OK for me! @basisbit

@antobinary
Copy link
Member

We just had demo2 OOM even with the --max-old-space-size=4096 --max_semi_space_size=128

Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: <--- Last few GCs --->
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: [10644:0x21b7110] 1142314303 ms: Mark-sweep 3940.8 (4429.4) -> 3940.7 (4429.4) MB, 2213.1 / 2.7 ms  allocation failure GC in old sp
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: [10644:0x21b7110] 1142317477 ms: Mark-sweep 3940.7 (4429.4) -> 3940.7 (4167.4) MB, 2846.0 / 3.1 ms  last resort GC in old space req
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: [10644:0x21b7110] 1142319667 ms: Mark-sweep 3940.7 (4167.4) -> 3940.7 (4167.4) MB, 2189.1 / 4.0 ms  last resort GC in old space req
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: <--- JS stacktrace --->
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: ==== JS stack trace =========================================
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: Security context: 0x14f07025891 <JSObject>
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:     1: /* anonymous */ [/usr/share/meteor/bundle/programs/server/packages/ejson.js:~697] [pc=0x6357c17d093](this=0x2fec79c8c2f1 <JS
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:     2: arguments adaptor frame: 3->1
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:     3: forEach(this=0x420f3dd04a1 <JSArray[27]>)
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:     4: transform [/usr/share/meteor/bundle/programs/server/packages/minimongo.js:~3101] [pc=0x6357b996c91](this=0x2fec79c8...
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  1: node::Abort() [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  2: 0x8cd49c [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  3: v8::Utils::ReportOOMFailure(char const*, bool) [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  4: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  5: v8::internal::Factory::NewCode(v8::internal::CodeDesc const&, unsigned int, v8::internal::Handle<v8::internal::Object>, bool, i
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  6: v8::internal::CodeGenerator::MakeCodeEpilogue(v8::internal::TurboAssembler*, v8::internal::EhFrameWriter*, v8::internal::Compil
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  7: v8::internal::compiler::CodeGenerator::FinalizeCode() [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  8: v8::internal::compiler::PipelineImpl::FinalizeCode() [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]:  9: v8::internal::compiler::PipelineCompilationJob::FinalizeJobImpl() [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: 10: v8::internal::Compiler::FinalizeCompilationJob(v8::internal::CompilationJob*) [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: 11: v8::internal::OptimizingCompileDispatcher::InstallOptimizedFunctions() [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: 12: v8::internal::StackGuard::HandleInterrupts() [/usr/share/node-v8.17.0-linux-x64/bin/node]
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: 13: v8::internal::Runtime_StackGuard(int, v8::internal::Object**, v8::internal::Isolate*) [/usr/share/node-v8.17.0-linux-x64/bin/no
Feb 09 14:15:08 demo2.bigbluebutton.org systemd_start.sh[9884]: 14: 0x635660842fd
Feb 09 14:15:08 demo2.bigbluebutton.org systemd[1]: bbb-html5.service: Main process exited, code=exited, status=134/n/a
Feb 09 14:15:08 demo2.bigbluebutton.org systemd[1]: bbb-html5.service: Unit entered failed state.
Feb 09 14:15:08 demo2.bigbluebutton.org systemd[1]: bbb-html5.service: Failed with result 'exit-code'.

@basisbit
Copy link
Collaborator Author

basisbit commented Feb 9, 2021

@antobinary got any graphs/details on uptime, concurrent users, biggest meeting size, inbound network traffic, inbound connections per second, bbb-version? Did someone maybe test their Denial of Service tool? Do you do any nginx rate limiting?

@basisbit
Copy link
Collaborator Author

°bump°

@antobinary got any graphs/details on uptime, concurrent users, biggest meeting size, inbound network traffic, inbound connections per second, bbb-version? Did someone maybe test their Denial of Service tool? Do you do any nginx rate limiting?

@pbdco
Copy link

pbdco commented Feb 16, 2021

Hi @antobinary did you included any of this improvements in the recent 2.2.32?

@ffdixon
Copy link
Member

ffdixon commented Feb 16, 2021

Yes. Could help test a prerelease of 2.3.32? See https://groups.google.com/g/bigbluebutton-dev/c/8T4SGMYm84o/m/YJLWiO5vAgAJ

@antobinary
Copy link
Member

I do not have this info...

@basisbit
Copy link
Collaborator Author

Closing this as fixed in 2.2.32

@basisbit
Copy link
Collaborator Author

By the way, if you are using the NodeJS CPU monitor for Grafana/Prometheus, you'll probably have to update it so that it still can detect the NodeJS process. https://gitlab.senfcall.de/senfcall-public/nodejs-cpu-monitor

@iwfet
Copy link

iwfet commented Apr 7, 2022

Estou com este mesmo problema preciso configurar um sevidor para ficar bem otimizado

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants