Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split html5-server in multiple processes to larger meetings #10349

Merged
merged 1 commit into from Mar 2, 2021

Conversation

amguirado73
Copy link

What does this PR do?

This PR allows to split html5-server in multiple processes. It is inspired by PR # 8788 but with a different approach. The main idea is to create multiple html5-server processes that allow bypassing the current limitation imposed by the fact that node executes practically in a single thread. In this way, you can get more users per meeting.

Motivation

Currently, there is a limitation whereby a meeting can have between 100-200 users depending on the restrictions applied to users. I would like get more users per meeting.

More

Using an environment variable METEOR_ROLE = [backend | frontend] multiple processes can be started. There should only be one backend process that will handle all events related to the mongoDB. The frontend processes, which only listen to the frontend-redis-channel redis channel, will only process certain messages. It is possible to start as many frontend processes as desired. NGINX is used to balance users' meteor connections.

Examples or use in file /usr/share/meteor/bundle/systemd_start.sh

#For 2 process FRONTEND and 1 BACKEND
/usr/bin/npx concurrently -n 'backend,frontend1,frontend2' "env METEOR_ROLE=backend PORT=3000 /usr/share/$NODE_VERSION/bin/node main.js" "env METEOR_ROLE=frontend PORT=3001 /usr/share/$NODE_VERSION/bin/node main.js" "env METEOR_ROLE=frontend PORT=3002 /usr/share/$NODE_VERSION/bin/node main.js"

#For 3 process FRONTEND and 1 BACKEND
/usr/bin/npx concurrently -n 'backend,frontend1,frontend2,frontend3' "env METEOR_ROLE=backend PORT=3000 /usr/share/$NODE_VERSION/bin/node main.js" "env METEOR_ROLE=frontend PORT=3001 /usr/share/$NODE_VERSION/bin/node main.js" "env METEOR_ROLE=frontend PORT=3002 /usr/share/$NODE_VERSION/bin/node main.js" "env METEOR_ROLE=frontend PORT=3003 /usr/share/$NODE_VERSION/bin/node main.js"

The configuration to apply in NGINX is:

File: /etc/nginx/nginx.conf

upstream poolhtml5servers {
            zone poolhtml5servers 32k;
            hash $ remote_addr;
            server 127.0.0.1:3001 fail_timeout = 5s max_fails = 3;
            server 127.0.0.1:3002 fail_timeout = 5s max_fails = 3;
    }

File /etc/bigbluebutton/nginx/bbb-html5.nginx
location / html5client {
proxy_pass http: // poolhtml5servers;
proxy_http_version 1.1;
proxy_set_header Upgrade $ http_upgrade;
proxy_set_header Connection "Upgrade";
}

The hash command $ remote_addr is used to ensure that each user always goes to the same html5server.

As collateral effects, a problem has been observed with external videos, when the presenter makes start / stop or reposition the video. To do this, the part of external videos has been modified for events that generate messages to Redis channels and thus be able to be received by all existing frontend processes. I have reused PR#7484 to do this.

Also, the bannedusers part has been modified so that mongoDB is used instead of a Set local to each process.

 Changes to be committed:
	new file:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core/apps/externalvideo/ExternalVideoApp2x.scala
	new file:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core/apps/externalvideo/StartExternalVideoPubMsgHdlr.scala
	new file:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core/apps/externalvideo/StopExternalVideoPubMsgHdlr.scala
	new file:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core/apps/externalvideo/UpdateExternalVideoPubMsgHdlr.scala
	modified:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core/pubsub/senders/ReceivedJsonMsgHandlerActor.scala
	modified:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core/running/MeetingActor.scala
	modified:   akka-bbb-apps/src/main/scala/org/bigbluebutton/core2/FromAkkaAppsMsgSenderActor.scala
	new file:   bbb-common-message/src/main/scala/org/bigbluebutton/common2/msgs/ExternalVideoMsgs.scala
	new file:   bigbluebutton-html5/imports/api/external-videos/server/eventHandlers.js
	new file:   bigbluebutton-html5/imports/api/external-videos/server/handlers/startExternalVideo.js
	new file:   bigbluebutton-html5/imports/api/external-videos/server/handlers/stopExternalVideo.js
	new file:   bigbluebutton-html5/imports/api/external-videos/server/handlers/updateExternalVideo.js
	modified:   bigbluebutton-html5/imports/api/external-videos/server/index.js
	modified:   bigbluebutton-html5/imports/api/external-videos/server/methods.js
	modified:   bigbluebutton-html5/imports/api/external-videos/server/methods/emitExternalVideoEvent.js
	modified:   bigbluebutton-html5/imports/api/external-videos/server/methods/startWatchingExternalVideo.js
	modified:   bigbluebutton-html5/imports/api/external-videos/server/methods/stopWatchingExternalVideo.js
	new file:   bigbluebutton-html5/imports/api/external-videos/server/streamer.js
	modified:   bigbluebutton-html5/imports/api/meetings/server/handlers/meetingDestruction.js
	modified:   bigbluebutton-html5/imports/api/meetings/server/modifiers/addMeeting.js
	modified:   bigbluebutton-html5/imports/api/meetings/server/modifiers/meetingHasEnded.js
	modified:   bigbluebutton-html5/imports/api/users/server/handlers/validateAuthToken.js
	modified:   bigbluebutton-html5/imports/api/users/server/store/bannedUsers.js
	modified:   bigbluebutton-html5/imports/startup/server/index.js
	modified:   bigbluebutton-html5/imports/startup/server/redis.js
	modified:   bigbluebutton-html5/imports/ui/components/external-video-player/service.js
	modified:   bigbluebutton-html5/private/config/settings.yml
@amguirado73 amguirado73 changed the title Committer: Antonio Guirado <amguirado73@gmail.com> Split html5-server in multiple processes to larger meetings Aug 28, 2020
@antobinary antobinary added this to the Release 2.3 milestone Aug 28, 2020
@antobinary
Copy link
Member

Hi @amguirado73 ! Thank you for your contribution! Could you please confirm you have filled out a CLA? https://docs.bigbluebutton.org/support/faq.html#why-do-i-need-to-sign-a-contributor-license-agreement-to-contribute-source-code ?

@amguirado73
Copy link
Author

amguirado73 commented Aug 29, 2020 via email

@ffdixon
Copy link
Member

ffdixon commented Sep 12, 2020

Contributor agreement received -- thanks!

@jibon57
Copy link
Contributor

jibon57 commented Sep 20, 2020

@amguirado73 thanks a lot for introducing this new PR also during email conversation. We've tested your PR in release build 2.2.23 & 2.2.25. We've tested with 2,000 real users using following machines:

2 X E5-2680 v2 @ 2.80GHz (total 40 cores)
62GB RAM
1Gbps uplink
&
2 X E5-2650 @ 2.00GHz (total 32 cores)
62GB RAM
1Gbps uplink

We've configured 5 html5 pool & tested over 1 week average 30~70 users in each session. We're really happy as all test goes well & found following observation:

  1. Each server was able to hold 800+ users without any problem. In some case 1100+ users & we didn't notice much problem. During this time webcams were very few. We saw most of the CPU was using by FreeSWITCH (1200%). Nodejs was average 40%.

  2. Each server was able to hold 480+ users & 110+ webcams. After that new meeting user was facing problem to share webcams. They're getting Error: 2003. During testing we split Kurento into 3 parts which was introduced v2.2.24. Also we decreased bitrate to 50kbit/s. Pagination was on where Moderator: 10 & Attendee: 5. During this time user didn't face problem to join with audio but faced problem with webcams only. This time Kurento CPU usage reached to 700%.

Our target was to hold 600 users on each server & we're happy to find it working. Hope BBB core team will have a look on this & merge/further improve. Thanks again to @amguirado73 for all your help during run the test.

@ffdixon
Copy link
Member

ffdixon commented Sep 20, 2020

Thanks @jibon57 for sharing your experience! This work (or a variation of it) will be merged into BigBlueButton 2.3-dev.

@iSamof
Copy link

iSamof commented Sep 20, 2020

@jibon57 - thank you very much for this valuable live test and sharing the results. One question:

When you say "Both server was able to hold 800+ users ..." do you mean both servers together had a total of 800+ users or each server managed to service 800+ ... a total of 1600+ between the two servers?

Thank you again

@jibon57
Copy link
Contributor

jibon57 commented Sep 21, 2020

@ffdixon thank you!
@iSamof each of the server was able to hold 800+ users & tested upto 1100+ on each.

@aguerson
Copy link

Hi guys,
You are on on the right way ;)

I have 4 questions

  1. On your 5 HTML servers, one meeting exploded on the 5, or one meeting will stay always in 1 of the 5 servers ?
  2. Could you increase the barrier of 5 server to up to 10 or more ?
  3. When you said "you decreased the bitrate to 50kbit/s", where did you did this ?
  4. Maybe I missed something on the doc, when you said "the pagination was on where Moderator: 10 & Attendee: 5" I activated the pagination in 2.2.25, this is not automatic ? you have to push a button during a meeting ? Active an option on the meeting ?

Regards,
Aurélien.

@aguerson
Copy link

And thank you again !

@amguirado73
Copy link
Author

amguirado73 commented Sep 21, 2020 via email

@aguerson
Copy link

@amguirado73
Very good news !
When this PR will be validated to be usable with 2.2.25+ ? Did you have some more doc to configure nginx ? I saw the doc on the PR desc ? is that enough to play with it ?

@ffdixon
Do you intend to do a doc on the subject in the official BBB doc ? It is too fresh ?

Without this PR and without the 3 KMS PR, I managed to accept 450 persons with 2cam in one meeting ! ( but the chat at the end didn't respond,... and the meeting freezed )...

with this server ( OpenVZ 7 installed and CTs in ubuntu 16.04 LTS used )
Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz ( 128 procs )
187 Go RAM
10g uplink ( but we didnt't up to 1G )

I already update to 2.2.25 to user the 3 KMS PR, and I hope with this PR, I could finaly take more than 500 users in one sessions.

@aguerson
Copy link

I managed to accept 450 persons with 2.2.23

@mabras
Copy link

mabras commented Sep 21, 2020

Thanks @amguirado73

I am just confused about:

Currently, there is a limitation whereby a meeting can have between 100-200 users depending on the restrictions applied to users. I would like get more users per meeting.

A possible sizing rule is 100-150 users per process, similarly to the current limit.

The motivation was to get more, if we still in the same limit then what is point? If I am using Scalelite then this not adding any value?

@amguirado73
Copy link
Author

amguirado73 commented Sep 21, 2020 via email

@aguerson
Copy link

aguerson commented Sep 21, 2020

How can you generate 100 bots, have you got a script ?
Could you share it ?

I want to test it to charge my server with the max bots users I could.

I could give you a feedback after.

@mabras
Copy link

mabras commented Sep 21, 2020

About if we still in the same limit for a meeting, I have not enough resources to do tests for this number of users (I can generate only 100 bots). I would appreciate any kind of help about it.

Just yesterday I was using:
https://github.com/mconf/bigbluebot
I was able to generate 250 bots with it. Five server, each 50 bots.

@amguirado73
Copy link
Author

amguirado73 commented Sep 21, 2020 via email

@aguerson
Copy link

aguerson commented Sep 22, 2020

I am always trying to implement bigbluebot. I opened an issue to use it

mconf/bigbluebot#11

@iSamof
Copy link

iSamof commented Sep 23, 2020

@jibon57 - Many thanks for the answers.

Thanks to everyone else for sharing the info.

@jibon57
Copy link
Contributor

jibon57 commented Sep 24, 2020

Hello guys,

I was checking PM2 cluster mode from here: https://pm2.keymetrics.io/docs/usage/cluster-mode/
As in this solution backend & frontend was separated so using this tool may be more helpful instated of using nginx load balancing. PM2 has better monitoring system. I did some simple test but not sure if it was correct way to do so but was working. As you're testing with bot then you can give try & see the difference. So, I've installed pm2 as global by
npm install pm2@latest -g.

Now edit /usr/share/meteor/bundle/systemd_start.sh:

#PORT=3000 /usr/share/$NODE_VERSION/bin/node main.js
env METEOR_ROLE=backend PORT=3000 pm2 start main.js --restart-delay=3000 --name backend
env METEOR_ROLE=frontend PORT=3001 pm2 start main.js -i 5 --restart-delay=3000 --name frontend

Nginx config: /etc/bigbluebutton/nginx/bbb-html5.nginx:

location /html5client {
  proxy_pass http://127.0.0.1:3001;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "Upgrade";
}

Now start process
bash /usr/share/meteor/bundle/systemd_start.sh
It should start process. Now you can monitor
pm2 monit

@aguerson
Copy link

In my case, I am always trying to find enough ressources to run the 500+ Bots.
Next I will try the first solution.
@jibon57 In your solution, did you apply first the commit and next you patch ? or just your patch ?

@ffdixon ffdixon mentioned this pull request Sep 26, 2020
@GhaziTriki
Copy link
Member

Hi @jibon57
Thank you very much for your pull request. I have checked it and it seems good if you want to have more concurrent users on the same server, however it does not seem to resolve another problem, I mean having more than 100 concurrent users per meeting.

Your servers have a lot of RAM, if you use PM2 to have a cluster then you will use a lot of RAM too and I think this is what is happening on your servers, same for your CPU you are using a lot of it.

@jibon57
Copy link
Contributor

jibon57 commented Nov 5, 2020

Thanks @GhaziTriki . Actually PR from @amguirado73 . For my case I didn't need to have more than 100 users capacity in per room that's why didn't perform test. But @amguirado73 did test with over 320 users. So far this solution is working fine for my case.

@GhaziTriki
Copy link
Member

@amguirado73 How much efforts is it to create a similar PR 2.2.x? I am interested into testing it.

@netzwerk-azv
Copy link

@amguirado73 How much efforts is it to create a similar PR 2.2.x? I am interested into testing it.

@netzwerk-azv
Copy link

Will this be included in one of the 2.2.xx releases or only in 2.3? I only read someone manage to test this in 2.2.23 and 2.2.25 but I am sure this would help a lot of users -right now- facing problems getting enough users on there servers and meetings within limited resources available.
And we dont know when a stable 2.3 will be released

I am currently on 2.2.29 (dev)

@schrd
Copy link
Collaborator

schrd commented Nov 23, 2020

We are running this patch adapted for 2.2.28 in production on 12 servers since a week with 2 frontend processes each. It works very well so far. We decided to give this patch a try because scalelite balancing strategy is so stupid. We have a very unequal distribution of conference sizes. Two weeks ago scalelite decided to balance two conferences with 150 participants each onto a server which was already loaded with 100 users. Of course the inevitable kicking of users happened. In a load test with 100 desktop computers running 5 bigbluebots each, this patch performed very well, we got 475 bots joined into 4 conferences.

However today one nodejs frontend processes crashed with

Nov 23 13:56:16 bbb-server systemd_start_frontend.sh[1151]: terminate called after throwing an instance of 'std::bad_alloc'
Nov 23 13:56:16 bbb-server systemd_start_frontend.sh[1151]:   what():  std::bad_alloc
Nov 23 13:56:16 bbb-server systemd_start_frontend.sh[1151]: /usr/share/meteor/bundle/systemd_start_frontend.sh: line 66:  2066 Aborted                 PORT=$1 /usr/share/$NODE_VERSION/bin/node main.js

I assume that this is a nodejs bug and is not related to the patch and I believe I saw this message some time ago as well without the patch. Memory was not short on the system. Existing users on the server were immediately balanced to the other frontend process by nginx. I doubt that anyone noticed that error. However when the process was restarted by systemd and nginx balanced connections to it, I discovered the following messages in the log:

Nov 23 13:56:42 bbb-server systemd_start_frontend.sh[8892]: error: Error while trying to send cursor streamer data for meeting xxxxxxx-yyy. TypeError: Cannot read property 'emit' of undefined

I could reproduce this on our test server by restarting one frontend process. The cause is that cursor position and annotations in the presentations are distributed between the users by the meteor-streamer library. The subscriptions of meteor-streamer are started by addMeeting events. The freshly started process however never got those events for the meetings already running. Distribution of annotations and cursor position worked for fresh conferences.

I would like to contribute a patch that fixes this behaviour, however I am not understanding why this meteor-streamer stuff works and why the replication of annotation state between the two meteor processes works at all. To me this meteor-streaming stuff looks like it should broadcast the messages between all users in a conference, but just within the same nodejs process.

Can anybody give me a hint where to look at and how to improve that patch that it will continue to work if meteor is restarted? My current idea is to query the mongodb at server startup for meetings and register those meetings for streaming of cursor and annotations.

@github-actions
Copy link

github-actions bot commented Dec 4, 2020

This pull request has conflicts ☹
Please resolve those so we can review the pull request.
Thanks.

@ichdasich
Copy link

With 2.3 taking its time, is there any chance to maybe get this into one of the coming 2.2 releases?

@ffdixon
Copy link
Member

ffdixon commented Dec 9, 2020

There is work underway to get a variation of this approach into the next build of BigBlueButton 2.3-dev.

@ichdasich
Copy link

So, not planned for 2.2? Guess I will have to wait for the 2.3 release then.

@ffdixon
Copy link
Member

ffdixon commented Dec 9, 2020

It's easier for us to implement and test this in 2.3-dev. We're weaning off of 2.2 soon as we want to focus our efforts on the next release and get the product onto Ubuntu 18.04 asap.

@cod3r0k
Copy link

cod3r0k commented Dec 10, 2020

@ffdixon Can we migrate from 2.2 to 2.3? or we can't? (I'm worry about my prev recorded views)

@ffdixon
Copy link
Member

ffdixon commented Dec 10, 2020

We would recommend setting up 2.3-dev on a new 18.04 server, leaving your 16.04 server untouched. The recommendation would be to then copy over the recordings from 2.2 onto the 2.3-dev server. This way, nothing changes on the 2.2 server and you can test the 2.3-dev server indepednetly.

@schrd
Copy link
Collaborator

schrd commented Jan 4, 2021

We are running this patch adapted for 2.2.28 in production on 12 servers since a week with 2 frontend processes each. It works very well so far. We decided to give this patch a try because scalelite balancing strategy is so stupid. We have a very unequal distribution of conference sizes. Two weeks ago scalelite decided to balance two conferences with 150 participants each onto a server which was already loaded with 100 users. Of course the inevitable kicking of users happened. In a load test with 100 desktop computers running 5 bigbluebots each, this patch performed very well, we got 475 bots joined into 4 conferences.

However today one nodejs frontend processes crashed with

Nov 23 13:56:16 bbb-server systemd_start_frontend.sh[1151]: terminate called after throwing an instance of 'std::bad_alloc'
Nov 23 13:56:16 bbb-server systemd_start_frontend.sh[1151]:   what():  std::bad_alloc
Nov 23 13:56:16 bbb-server systemd_start_frontend.sh[1151]: /usr/share/meteor/bundle/systemd_start_frontend.sh: line 66:  2066 Aborted                 PORT=$1 /usr/share/$NODE_VERSION/bin/node main.js

We are running this patch since 7 weeks with more than 32k meeting hours and more than 200k participant hours. A meeting with 10 users for one hour is counted as 1 meeting hour and 10 participant hours. We ported this to 2.2.30 in the mean time. In total we hat 3 nodejs crashes across all of our servers since then. So I assume that this is safe. At least I never hat to struggle with overloaded nodejs processes since then. It also reduces the latency of actions such as mute/unmute on servers with more than 200 concurrent users.

Compared to #11008 this approach is superior in my optinion:

  • you can have more participants in a meeting than a single nodejs instance can handle. This would allow to run BBB on servers with many but slower cores (such as ARM servers)
  • no meeting routing logic in bbb-web is required. This reduces complexity.
  • if one meteor instance crashes for whatever reason, nginx will rebalance the users to the other running frontend. The user will only notice the reconnection notification when meteor reestablishes its websocket. In contrast to the current situation without this patch: if nodejs crashes, all meetings are destroyed on this server
  • The frontend processes can be setup with systemd options RestartSec=5s and Restart=on-failure (we run it this way)

I setup the nginx loadbalancer to use the number of connections for deciding to which meteor process to use. We run it with 2 workers currently.

@ichdasich
Copy link

ichdasich commented Jan 4, 2021

Do you maybe have documentation for applying the patch to 2.2.30?

Also: Does applying the patch require repackaging BBB, and are there any changes in 2.2.30 which require adjustments to the patch?

@schrd
Copy link
Collaborator

schrd commented Jan 5, 2021

Do you maybe have documentation for applying the patch to 2.2.30?

Also: Does applying the patch require repackaging BBB, and are there any changes in 2.2.30 which require adjustments to the patch?

You need to rebuild bbb-html5 and bbb-apps-akka packages. In addition a new systemd unit bbb-html5-frontend@.service is neccesary which serve the user requests. Nginx needs a loadbalancer config:

# /etc/nginx/conf.d/bbb-html5-loadbalancer.conf 
upstream poolhtml5servers {
  zone poolhtml5servers 32k;
  least_conn;
  server 127.0.0.1:3001 fail_timeout=5s max_fails=3;
  server 127.0.0.1:3002 fail_timeout=5s max_fails=3;
}
# /etc/bigbluebutton/nginx/bbb-html5.nginx
location /html5client {
  proxy_pass http://poolhtml5servers;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "Upgrade";
}

location /_timesync {
  proxy_pass http://127.0.0.1:3000;
}

The patched source tree is here: https://gitlab.hrz.tu-chemnitz.de/bigbluebutton/bigbluebutton/-/tree/pr-10349-2.2.30
To create packages I created https://gitlab.hrz.tu-chemnitz.de/bigbluebutton/bigbluebutton-packaging
I did not yet patch bbb-conf because we start/stop services using our configuration management system.

The systemd unit is included in https://gitlab.hrz.tu-chemnitz.de/bigbluebutton/bigbluebutton-packaging/-/blob/master/bbb-html5/bbb-html5-frontend@.service

@antobinary
Copy link
Member

I am looking at the nginx load balancing strategies trying to understand more about the way frontend f1 is preferred over frontend f2 when a new user is joining. I understand that the hash approach is ensuring that if a user refreshes, the same frontend will be used as for the previous connection.
Could anyone comment on this?
Was the 'Least Connections' approach considered?

@schrd
Copy link
Collaborator

schrd commented Feb 10, 2021

We are running it with least_conn as balancing strategy. This way the load on the frontends is almost equal.

@jibon57
Copy link
Contributor

jibon57 commented Feb 11, 2021

Before I always tried with hash $remote_addr; but last 2 days tried with least_conn. However I didn't notice much differences.

I also tried with pm2. Using pm2 I got very good result too.

@antobinary
Copy link
Member

2.3-alpha7 was just released and it included #11317 which is based on this work of @amguirado73 and @jfsiebel https://github.com/bigbluebutton/bigbluebutton/releases/tag/v2.3-alpha-7

Please give it a try if you have the opportunity to put some load on a 2.3-alpha7 server (not in production)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet