Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running on docker with prometheus open keep eating up memory #666

Closed
yuanchieh-cheng opened this issue Dec 16, 2020 · 20 comments
Closed

Running on docker with prometheus open keep eating up memory #666

yuanchieh-cheng opened this issue Dec 16, 2020 · 20 comments
Labels
Milestone

Comments

@yuanchieh-cheng
Copy link

It seems that prometheus keeps all relay connection records in memory, so the CPU and memory keep increasing.
After reboot the server, CPU / memory drop down.

Is there any way to clear the memory without restarting the server?
It's ok to clear all the records.

@yuanchieh-cheng yuanchieh-cheng changed the title Running on docker with prometheus keep eating up memory Running on docker with prometheus open keep eating up memory Dec 16, 2020
@domeger
Copy link

domeger commented Dec 16, 2020

@yuanchieh-cheng how did you enable the endpoint ?, via the config ?

@yuanchieh-cheng
Copy link
Author

yuanchieh-cheng commented Dec 16, 2020

via config

Can I switch by endpoint ?
Didn't see the document.

@misi
Copy link
Contributor

misi commented Dec 16, 2020

@yuanchieh-cheng Can you please find the root and fix it and send a patch Pull Request.
I work on other issues, don't find the time to check it in short term...
Many Thanks for your help in advance

@misi misi added the bug label Dec 16, 2020
@misi misi added this to the next build milestone Dec 16, 2020
@yuanchieh-cheng
Copy link
Author

yuanchieh-cheng commented Dec 16, 2020

I am not the c programmer but I would try my best.
My propose is to flush all the data if the api /metrics is called successfully, or it would keep collecting data same as now.
Is it ok to do so ?

@misi
Copy link
Contributor

misi commented Dec 16, 2020

@wolmi Can you please help on the fix of the issue?
See: #517

@misi
Copy link
Contributor

misi commented Dec 16, 2020

@yuanchieh-cheng I don't have a prometheus setup, so I don't fully see yet what data's stuck in memory.
Can you dig deeper in a memory dump and give more info about what info's/data block are stuck in memory and eating the memory?

@yuanchieh-cheng
Copy link
Author

Ok, I would try to produce the debug data

@wolmi
Copy link
Contributor

wolmi commented Dec 16, 2020

I think it is related to the huge amount of allocations and the fact that the allocation is a label that causes to have a lot of information on each metric.

The use case is a test environment or is it production? can you share some info about the number of users?

@yuanchieh-cheng
Copy link
Author

yuanchieh-cheng commented Dec 17, 2020

Yes I think that is the root cause of the issue.
It is under testing but I think it should be environment unrelated issue.
I run a cronjob at 1 min rate to start TCP relay for health check.
Roughly after one day (86400 records generated), it ate up 200 MB roughly.
Even worst I found the container throw Segment fault(core dump) after a day, only the health check cronjob no other users.

@deyceg
Copy link

deyceg commented Dec 23, 2020

I think it is related to the huge amount of allocations and the fact that the allocation is a label that causes to have a lot of information on each metric.

The use case is a test environment or is it production? can you share some info about the number of users?

@wolmi We came to the same conclusion. Allocation ID is a very bad, high cardinality label. It should be removed otherwise the exporter is not usable.

More information can be found here

@wolmi
Copy link
Contributor

wolmi commented Dec 23, 2020

I've managed to have a similar issue on my first exporter where I create basic process to subscribe the Redis stats and generate the Prometheus /metris endpoint. I finally have to remove the allocation from labels, first to fix the problem when performing queries to Prometheus that makes Prometheus to crash but also because it was useless related to not be able to track the users connection using only the allocation, especially when you have multiple instances and the users are automatically balanced between them.

When I decide to add the exporter direct into the coturn service my main goal was to have the exact metrics and output of Redis to make it some way compatible, but now if the allocations are a problem I can propose two different approaches:

  • remove it completely from the metrics.
  • make it optional disabled by default and adding a new parameter to allow having the allocation label if someone needs it.

@deyceg
Copy link

deyceg commented Dec 23, 2020

I've managed to have a similar issue on my first exporter where I create basic process to subscribe the Redis stats and generate the Prometheus /metris endpoint. I finally have to remove the allocation from labels, first to fix the problem when performing queries to Prometheus that makes Prometheus to crash but also because it was useless related to not be able to track the users connection using only the allocation, especially when you have multiple instances and the users are automatically balanced between them.

When I decide to add the exporter direct into the coturn service my main goal was to have the exact metrics and output of Redis to make it some way compatible, but now if the allocations are a problem I can propose two different approaches:

* remove it completely from the metrics.

* make it optional disabled by default and adding a new parameter to allow having the allocation label if someone needs it.

My preference is to simply remove it. If you want to do analysis per user/session then storing the allocations in redis/mysql/mongodb/some other storage backend is a better tool for the job. Lets not conflate logging and metrics.

There is no solution here that doesn't have a massive impact on Prometheus. It basically killed our monitoring stack with just a handful of users because of trickle ICE (a single WebRTC call for a single user created 15 separate sessions!). Now multiple this by a few hundred conference rooms and a few thousand users :).

metrics

I'd be more interested in seeing connection statistics and their state in Prometheus, as well as bandwidth usage per server rather than per session. Likewise breaking allocations down by transport type. These are all good examples of labels with low cardinality.

@sj82516
Copy link

sj82516 commented Dec 23, 2020

I wana echo @deyceg comment.

Per user data is important for me to do some budget control.
Storing user session data into database is good idea because I could better estimate the storage size and manage the storage expansion.

For Prometheus metrics, the over whole system metrics like current session amount / current bandwidth are quite enough.

@wolmi
Copy link
Contributor

wolmi commented Dec 23, 2020

Ok, it seems it's time to make decisions.

@misi are you ok on removing the allocation label?

If yes I'll try to prepare a PR, I cannot promise it in short time, because the holidays, but it's easy to modify now.

@sj82516
Copy link

sj82516 commented Jan 6, 2021

I write one turn monitor and it work pretty well in our production env.
If anyone need any statistics right now, maybe you could consider my solution https://github.com/sj82516/turn-monitor.
Any suggestions are welcome.

@misi misi modified the milestones: next build, 4.5.2 Jan 7, 2021
@misi
Copy link
Contributor

misi commented Jan 9, 2021

@wolmi I am ok with removing the allocation..

@misi
Copy link
Contributor

misi commented Jan 10, 2021

I did what I can to simplify it to include it in 4.5.2 but it is still a beta not for production!

I will keep this issue open to remember to add more metrics in the 4.5.3++ release.
@wolmi if you have time please send a PR about it.

Thanks to all for Your contribution!

@misi misi modified the milestones: 4.5.2, next build, 4.5.3 Jan 10, 2021
@wolmi
Copy link
Contributor

wolmi commented Feb 23, 2021

@misi I've just checked you have already refactored the https://github.com/coturn/coturn/blob/master/src/apps/relay/prom_server.c and you have removed the allocation.

Do you need me to make any more changes?

@nazar-pc
Copy link

Is someone planning to work on this before 4.5.3 is released?
This was expected to happen (#517 (comment)) and usable Prometheus support would be a huge benefit for production use.

@dsmeytis
Copy link
Contributor

Hello, @misi, @wolmi. In our project we use REST API for authentication and prometheus for metrics gathering. It leads to the same memory increasing issue because any new allocation actually creates new user with the name "timestamp:userid". I would like to propose to make metrics per user optional. Also, I believe it may be useful to have a metric of the current allocations number even without any details, e.g. for draining instance before termination by auto-scaling groups. I can prepare PRs if you are interested in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants