Running on docker with prometheus open keep eating up memory #666

yuanchieh-cheng · 2020-12-16T03:07:31Z

It seems that prometheus keeps all relay connection records in memory, so the CPU and memory keep increasing.
After reboot the server, CPU / memory drop down.

Is there any way to clear the memory without restarting the server?
It's ok to clear all the records.

domeger · 2020-12-16T04:47:44Z

@yuanchieh-cheng how did you enable the endpoint ?, via the config ?

yuanchieh-cheng · 2020-12-16T05:27:27Z

via config

Can I switch by endpoint ?
Didn't see the document.

misi · 2020-12-16T07:43:12Z

@yuanchieh-cheng Can you please find the root and fix it and send a patch Pull Request.
I work on other issues, don't find the time to check it in short term...
Many Thanks for your help in advance

yuanchieh-cheng · 2020-12-16T07:56:09Z

I am not the c programmer but I would try my best.
My propose is to flush all the data if the api /metrics is called successfully, or it would keep collecting data same as now.
Is it ok to do so ?

misi · 2020-12-16T08:01:58Z

@wolmi Can you please help on the fix of the issue?
See: #517

misi · 2020-12-16T08:05:02Z

@yuanchieh-cheng I don't have a prometheus setup, so I don't fully see yet what data's stuck in memory.
Can you dig deeper in a memory dump and give more info about what info's/data block are stuck in memory and eating the memory?

yuanchieh-cheng · 2020-12-16T09:03:00Z

Ok, I would try to produce the debug data

wolmi · 2020-12-16T16:36:05Z

I think it is related to the huge amount of allocations and the fact that the allocation is a label that causes to have a lot of information on each metric.

The use case is a test environment or is it production? can you share some info about the number of users?

yuanchieh-cheng · 2020-12-17T02:23:19Z

Yes I think that is the root cause of the issue.
It is under testing but I think it should be environment unrelated issue.
I run a cronjob at 1 min rate to start TCP relay for health check.
Roughly after one day (86400 records generated), it ate up 200 MB roughly.
Even worst I found the container throw Segment fault(core dump) after a day, only the health check cronjob no other users.

deyceg · 2020-12-23T11:06:57Z

I think it is related to the huge amount of allocations and the fact that the allocation is a label that causes to have a lot of information on each metric.

The use case is a test environment or is it production? can you share some info about the number of users?

@wolmi We came to the same conclusion. Allocation ID is a very bad, high cardinality label. It should be removed otherwise the exporter is not usable.

More information can be found here

wolmi · 2020-12-23T12:01:38Z

I've managed to have a similar issue on my first exporter where I create basic process to subscribe the Redis stats and generate the Prometheus /metris endpoint. I finally have to remove the allocation from labels, first to fix the problem when performing queries to Prometheus that makes Prometheus to crash but also because it was useless related to not be able to track the users connection using only the allocation, especially when you have multiple instances and the users are automatically balanced between them.

When I decide to add the exporter direct into the coturn service my main goal was to have the exact metrics and output of Redis to make it some way compatible, but now if the allocations are a problem I can propose two different approaches:

remove it completely from the metrics.
make it optional disabled by default and adding a new parameter to allow having the allocation label if someone needs it.

deyceg · 2020-12-23T12:12:38Z

I've managed to have a similar issue on my first exporter where I create basic process to subscribe the Redis stats and generate the Prometheus /metris endpoint. I finally have to remove the allocation from labels, first to fix the problem when performing queries to Prometheus that makes Prometheus to crash but also because it was useless related to not be able to track the users connection using only the allocation, especially when you have multiple instances and the users are automatically balanced between them.

When I decide to add the exporter direct into the coturn service my main goal was to have the exact metrics and output of Redis to make it some way compatible, but now if the allocations are a problem I can propose two different approaches:
* remove it completely from the metrics.

* make it optional disabled by default and adding a new parameter to allow having the allocation label if someone needs it.

My preference is to simply remove it. If you want to do analysis per user/session then storing the allocations in redis/mysql/mongodb/some other storage backend is a better tool for the job. Lets not conflate logging and metrics.

There is no solution here that doesn't have a massive impact on Prometheus. It basically killed our monitoring stack with just a handful of users because of trickle ICE (a single WebRTC call for a single user created 15 separate sessions!). Now multiple this by a few hundred conference rooms and a few thousand users :).

I'd be more interested in seeing connection statistics and their state in Prometheus, as well as bandwidth usage per server rather than per session. Likewise breaking allocations down by transport type. These are all good examples of labels with low cardinality.

sj82516 · 2020-12-23T21:25:22Z

I wana echo @deyceg comment.

Per user data is important for me to do some budget control.
Storing user session data into database is good idea because I could better estimate the storage size and manage the storage expansion.

For Prometheus metrics, the over whole system metrics like current session amount / current bandwidth are quite enough.

wolmi · 2020-12-23T23:21:28Z

Ok, it seems it's time to make decisions.

@misi are you ok on removing the allocation label?

If yes I'll try to prepare a PR, I cannot promise it in short time, because the holidays, but it's easy to modify now.

sj82516 · 2021-01-06T00:06:48Z

I write one turn monitor and it work pretty well in our production env.
If anyone need any statistics right now, maybe you could consider my solution https://github.com/sj82516/turn-monitor.
Any suggestions are welcome.

misi · 2021-01-09T21:42:37Z

@wolmi I am ok with removing the allocation..

misi · 2021-01-10T08:39:11Z

I did what I can to simplify it to include it in 4.5.2 but it is still a beta not for production!

I will keep this issue open to remember to add more metrics in the 4.5.3++ release.
@wolmi if you have time please send a PR about it.

Thanks to all for Your contribution!

wolmi · 2021-02-23T10:50:36Z

@misi I've just checked you have already refactored the https://github.com/coturn/coturn/blob/master/src/apps/relay/prom_server.c and you have removed the allocation.

Do you need me to make any more changes?

nazar-pc · 2021-05-25T07:57:30Z

Is someone planning to work on this before 4.5.3 is released?
This was expected to happen (#517 (comment)) and usable Prometheus support would be a huge benefit for production use.

dsmeytis · 2021-06-29T10:12:35Z

Hello, @misi, @wolmi. In our project we use REST API for authentication and prometheus for metrics gathering. It leads to the same memory increasing issue because any new allocation actually creates new user with the name "timestamp:userid". I would like to propose to make metrics per user optional. Also, I believe it may be useful to have a metric of the current allocations number even without any details, e.g. for draining instance before termination by auto-scaling groups. I can prepare PRs if you are interested in it.

yuanchieh-cheng changed the title ~~Running on docker with prometheus keep eating up memory~~ Running on docker with prometheus open keep eating up memory Dec 16, 2020

misi added the bug label Dec 16, 2020

misi added this to the next build milestone Dec 16, 2020

yuanchieh-cheng mentioned this issue Dec 18, 2020

Is it possible to open new API to clear data? digitalocean/prometheus-client-c#48

Open

misi modified the milestones: next build, 4.5.2 Jan 7, 2021

misi modified the milestones: 4.5.2, next build, 4.5.3 Jan 10, 2021

yuanchieh-cheng mentioned this issue Feb 24, 2021

Not seeing "prometheus" parameter in turnserver.conf file #712

Closed

sysvinit mentioned this issue Jun 20, 2022

Configurable username labelling in Prometheus metrics #919

Merged

eakraly closed this as completed Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on docker with prometheus open keep eating up memory #666

Running on docker with prometheus open keep eating up memory #666

yuanchieh-cheng commented Dec 16, 2020

domeger commented Dec 16, 2020

yuanchieh-cheng commented Dec 16, 2020 •

edited

misi commented Dec 16, 2020 •

edited

yuanchieh-cheng commented Dec 16, 2020 •

edited

misi commented Dec 16, 2020

misi commented Dec 16, 2020

yuanchieh-cheng commented Dec 16, 2020

wolmi commented Dec 16, 2020

yuanchieh-cheng commented Dec 17, 2020 •

edited

deyceg commented Dec 23, 2020 •

edited

wolmi commented Dec 23, 2020

deyceg commented Dec 23, 2020 •

edited

sj82516 commented Dec 23, 2020 •

edited

wolmi commented Dec 23, 2020 •

edited

sj82516 commented Jan 6, 2021

misi commented Jan 9, 2021

misi commented Jan 10, 2021

wolmi commented Feb 23, 2021

nazar-pc commented May 25, 2021

dsmeytis commented Jun 29, 2021

Running on docker with prometheus open keep eating up memory #666

Running on docker with prometheus open keep eating up memory #666

Comments

yuanchieh-cheng commented Dec 16, 2020

domeger commented Dec 16, 2020

yuanchieh-cheng commented Dec 16, 2020 • edited

misi commented Dec 16, 2020 • edited

yuanchieh-cheng commented Dec 16, 2020 • edited

misi commented Dec 16, 2020

misi commented Dec 16, 2020

yuanchieh-cheng commented Dec 16, 2020

wolmi commented Dec 16, 2020

yuanchieh-cheng commented Dec 17, 2020 • edited

deyceg commented Dec 23, 2020 • edited

wolmi commented Dec 23, 2020

deyceg commented Dec 23, 2020 • edited

sj82516 commented Dec 23, 2020 • edited

wolmi commented Dec 23, 2020 • edited

sj82516 commented Jan 6, 2021

misi commented Jan 9, 2021

misi commented Jan 10, 2021

wolmi commented Feb 23, 2021

nazar-pc commented May 25, 2021

dsmeytis commented Jun 29, 2021

yuanchieh-cheng commented Dec 16, 2020 •

edited

misi commented Dec 16, 2020 •

edited

yuanchieh-cheng commented Dec 16, 2020 •

edited

yuanchieh-cheng commented Dec 17, 2020 •

edited

deyceg commented Dec 23, 2020 •

edited

deyceg commented Dec 23, 2020 •

edited

sj82516 commented Dec 23, 2020 •

edited

wolmi commented Dec 23, 2020 •

edited