New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running on docker with prometheus open keep eating up memory #666
Comments
@yuanchieh-cheng how did you enable the endpoint ?, via the config ? |
via config Can I switch by endpoint ? |
@yuanchieh-cheng Can you please find the root and fix it and send a patch Pull Request. |
I am not the c programmer but I would try my best. |
@yuanchieh-cheng I don't have a prometheus setup, so I don't fully see yet what data's stuck in memory. |
Ok, I would try to produce the debug data |
I think it is related to the huge amount of allocations and the fact that the allocation is a label that causes to have a lot of information on each metric. The use case is a test environment or is it production? can you share some info about the number of users? |
Yes I think that is the root cause of the issue. |
@wolmi We came to the same conclusion. Allocation ID is a very bad, high cardinality label. It should be removed otherwise the exporter is not usable. More information can be found here |
I've managed to have a similar issue on my first exporter where I create basic process to subscribe the Redis stats and generate the Prometheus /metris endpoint. I finally have to remove the allocation from labels, first to fix the problem when performing queries to Prometheus that makes Prometheus to crash but also because it was useless related to not be able to track the users connection using only the allocation, especially when you have multiple instances and the users are automatically balanced between them. When I decide to add the exporter direct into the coturn service my main goal was to have the exact metrics and output of Redis to make it some way compatible, but now if the allocations are a problem I can propose two different approaches:
|
My preference is to simply remove it. If you want to do analysis per user/session then storing the allocations in redis/mysql/mongodb/some other storage backend is a better tool for the job. Lets not conflate logging and metrics. There is no solution here that doesn't have a massive impact on Prometheus. It basically killed our monitoring stack with just a handful of users because of trickle ICE (a single WebRTC call for a single user created 15 separate sessions!). Now multiple this by a few hundred conference rooms and a few thousand users :). I'd be more interested in seeing connection statistics and their state in Prometheus, as well as bandwidth usage per server rather than per session. Likewise breaking allocations down by transport type. These are all good examples of labels with low cardinality. |
I wana echo @deyceg comment. Per user data is important for me to do some budget control. For Prometheus metrics, the over whole system metrics like current session amount / current bandwidth are quite enough. |
Ok, it seems it's time to make decisions. @misi are you ok on removing the allocation label? If yes I'll try to prepare a PR, I cannot promise it in short time, because the holidays, but it's easy to modify now. |
I write one turn monitor and it work pretty well in our production env. |
@wolmi I am ok with removing the allocation.. |
I did what I can to simplify it to include it in 4.5.2 but it is still a beta not for production! I will keep this issue open to remember to add more metrics in the 4.5.3++ release. Thanks to all for Your contribution! |
@misi I've just checked you have already refactored the https://github.com/coturn/coturn/blob/master/src/apps/relay/prom_server.c and you have removed the allocation. Do you need me to make any more changes? |
Is someone planning to work on this before 4.5.3 is released? |
Hello, @misi, @wolmi. In our project we use REST API for authentication and prometheus for metrics gathering. It leads to the same memory increasing issue because any new allocation actually creates new user with the name "timestamp:userid". I would like to propose to make metrics per user optional. Also, I believe it may be useful to have a metric of the current allocations number even without any details, e.g. for draining instance before termination by auto-scaling groups. I can prepare PRs if you are interested in it. |
It seems that prometheus keeps all relay connection records in memory, so the CPU and memory keep increasing.
After reboot the server, CPU / memory drop down.
Is there any way to clear the memory without restarting the server?
It's ok to clear all the records.
The text was updated successfully, but these errors were encountered: