New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose a health check endpoint? #69
Comments
This is a bit odd, we have not seen collectors die across our infrastructure at all. This metric is there to inform the central server that the collector has no metrics. My guess is that this would happen if you restarted the collector container and somehow your old containers were no longer able to talk to it. I think the remediation here is actually to restart the entire pod. Quick question first, do you have IPv6 stability across container destroy/create? Are you communicating via v4? If so do you have port stability? |
Hi @SamSaffron Thanks for your reply. We are using ipv4 only in our infrastructure. We have subclassed
What we are observing is that the memory usage of the prometheus_exporter can go up even to over 1 GB which is really strange. Also what you are implying in your reply is that you are running the prometheus_exporter as a central server? We are running it as a sidecar of each puma server (we are having over 90 pods) which I guess is not recommended. And I do find that since some metrics are of type One thing we are afraid of running the prometheus_expoter as a central server is that there's a single point of failure and we found if the prometheus_expoter is down, our puma server will be flooded by error logs like:
|
1GB in a collector is a big concern that would mean some list is ever growing, summary metrics do have a roof on them, but I guess if you keep adding metrics problems can arise. Or maybe have a summary with 10000 different label permutations stuff can get strange. Can you look at one of your collectors say at 512MB and see how many metrics you have reported? then maybe do a ruby heap dump via rbtrace to see where the leak is. You can run as many collectors as you need but certainly there is a memory impact, that said it should be small in the big scheme. This is one of our collectors on a very busy server that has been running since March 27 (RSS is merely 36872)
Let's first work out what your memory issue is prior to attempting workarounds here. |
@duanshiqiang we also noticed memory leak in exporter, do u have any updates ? |
are you using last version or maybe some previous one ? |
Whatever this is I need more data, can you run through the steps here
https://samsaffron.com/archive/2015/03/31/debugging-memory-leaks-in-ruby
…On Thu, 27 Jun 2019 at 7:30 pm, Igor Fedoronchuk ***@***.***> wrote:
@SamSaffron <https://github.com/SamSaffron>
This is one of our collectors on a very busy server that has been running
since March 27 (RSS is merely 36872)
are you using last version or maybe some previous one ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=AAABIXKTALNCT76A7INUTYTP4SCCTA5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYWRHBI#issuecomment-506270597>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABIXJ7SLRSOIDUJGQVGH3P4SCCTANCNFSM4HGQXI4Q>
.
|
@SamSaffron thanks for your article, I started exporter and launched ab script for both metrics endpoint and some rails app endpoint .
I started to repeat everything from your guide.
So i picked up 3315 generation for detailed analyzing.
Can you please suggest what next to do ? |
3315 is not a good one to pic ... generation 1250 objects 22 This looks like a steady object leak of sorts, what are the objects? Note rss there is 71504 which is not enormous but could be better. |
@SamSaffron , they all are pointed to |
That is an indicator that you are collecting a giant pile of metric.
What metrics do you have, do you have a script that can repro the issue?
…On Tue, Jul 2, 2019 at 5:01 PM Igor Fedoronchuk ***@***.***> wrote:
@SamSaffron <https://github.com/SamSaffron> , they all are pointed to prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter.rb:15
* 22
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=AAABIXLRMZYWTZS6W4QI6HLP5L4NZA5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZAI7SA#issuecomment-507547592>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABIXNUMNAIETOQL3QMSFTP5L4NZANCNFSM4HGQXI4Q>
.
|
@SamSaffron https://gist.github.com/Fivell/224a9cda0fadf1ff323a730c5a0d36dc , as I told i just call several endpoints thousands times using |
@SamSaffron I also repeated that on staging env
was created as separate issue(#77) metrics output
and heap data analyzing is also pointed to
as u can see ruby_collector_rss 571297792 from metrics, so i guess it is really leaking, i think @duanshiqiang will have same results |
after trying json instead of oj have same results but @SamSaffron found your heapy gem, very helpfull |
Fundamentally all the data is entering from a single entry point: So what you can do here to create a 100% reproduction of the issue is log the "string" and "timestamp" to a file. Then if we have the file we can replay it ... mock time.now and Process.clock_gettime(Process::CLOCK_MONOTONIC) and reproduce the issue 100% consistently. Note that heapy dump is not showing anything terribly bad, yes there is bits of duplication but only 20mb or so are in Ruby heaps. Also I wonder maybe we are somehow getting VSZ on your server vs RSS? |
@SamSaffron sample attached |
That is not nearly enough data imo... Ruby boots at like 30mb and can get
to 60mb very easily with almost no load.
What we do at Discourse we run Ruby with a tighter memory footprint by
doing:
https://github.com/discourse/discourse-prometheus/blob/f5f791f0312c3a25fca67172c991505bdce44f13/lib/demon.rb#L18-L19
What I would like to see here is the graph of it reaching same 200M with
the attached logs.
In the mean time it may be worth working on a standalone repro.
```
collector = PrometheusExporter::Server.Collector.new
process(log) do |str, time|
mock_time(time)
collector.process(str)
end
```
If we can get that then to repro a leak we can debug it very cleanly with
the memory profiler gem. Cause it basically will take seconds to reach the
same memory bloat you are seeing taking days.
…On Thu, Jul 4, 2019 at 2:01 AM Igor Fedoronchuk ***@***.***> wrote:
@SamSaffron <https://github.com/SamSaffron> sample attached
prometheus-exrpoter-debug.log
<https://github.com/discourse/prometheus_exporter/files/3355770/prometheus-exrpoter-debug.log>
[image: image]
<https://user-images.githubusercontent.com/120269/60606917-e4d7a780-9dc4-11e9-8085-2fcf2b8927ec.png>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=AAABIXK23KQCATRNTZQHCFDP5TENRA5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZE5I4Q#issuecomment-508154994>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABIXLUSV4UI2AFV3VLMPDP5TENRANCNFSM4HGQXI4Q>
.
|
@SamSaffron I'm not sure time is somehow affects.
and have very soon this picture
|
@SamSaffron sample of this but without time mocking shows after 25m file processing next
any suggestions? |
Oh my ... this puma collector is a GIANT problem.
Compare to this:
https://github.com/discourse/prometheus_exporter/blob/32a6bd1f6c670096b4cc230b242a1f0cb8dcfb89/lib/prometheus_exporter/server/process_collector.rb#L64-L67
Can you see how it follows a clean pattern to clear up history.
Puma collector is doing none of this.
https://github.com/discourse/prometheus_exporter/blob/32a6bd1f6c670096b4cc230b242a1f0cb8dcfb89/lib/prometheus_exporter/server/puma_collector.rb#L49-L51
I accepted this as a community contribution as I am not a puma user in my
projects. But sadly this is a giant enormous issue.
https://github.com/discourse/prometheus_exporter/pull/34/files needs to be
redone to follow the same practice we do process collector.
The key here is instead of growing an array forever in the collector, this
needs to have some sort of max age defined so there is no situation where.
Can you try a PR to fix this, just copy the pattern used in process
collector for max age and see if you can figure out a clean way of deduping
(not critical cause this thing is 100% gauges), note how you will have to
start mocking time to see the fix actually working.
Thank you so much for persisting here! You went above and beyond.
I am sadly on holidays at the moment (for another 3 weeks) so I can not
really work on this but I will merge and push a gem once you test the fix.
…On Mon, Jul 8, 2019 at 9:09 PM Igor Fedoronchuk ***@***.***> wrote:
@SamSaffron <https://github.com/SamSaffron> sample of this but without
time mocking
https://github.com/Fivell/exp_memory_test
shows after 25m file processing next
Total allocated: 267388634 bytes (2373642 objects)
Total retained: 83211730 bytes (864334 objects)
allocated memory by gem
-----------------------------------
220305928 json
47082706 other
allocated memory by file
-----------------------------------
220305928 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb
47082706 test.rb
allocated memory by location
-----------------------------------
217064088 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:156
43770282 test.rb:12
3312344 test.rb:11
3241840 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:155
80 test.rb:13
allocated memory by class
-----------------------------------
111253154 String
94661728 JSON::Ext::Parser
58161328 Hash
3303920 Array
8424 File
80 Proc
allocated objects by gem
-----------------------------------
1959164 json
414478 other
allocated objects by file
-----------------------------------
1959164 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb
414478 test.rb
allocated objects by location
-----------------------------------
1878118 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:156
331878 test.rb:12
82599 test.rb:11
81046 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:155
1 test.rb:13
allocated objects by class
-----------------------------------
1964376 String
245620 Hash
82598 Array
81046 JSON::Ext::Parser
1 File
1 Proc
retained memory by gem
-----------------------------------
83211730 json
any suggestions?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69?email_source=notifications&email_token=AAABIXM6XRTSI4PI4DKWM6DP6OGG5A5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZOB5FI#issuecomment-509353621>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAABIXMD7XYHWYZWLSU5KNTP6OGG5ANCNFSM4HGQXI4Q>
.
|
@eviltrout I can not fix this... but it is a very big mess that impacts every consumer of this gem that uses Puma. (it does not impact Discourse, but is very not good) Can you get someone to fix this and push a new gem out there? I just gave you push on the gem. Fix is pretty straightforward, follow same pattern process collector uses so array does not grow forever. |
@SamSaffron started testing on staging with this didww@809b382 |
@SamSaffron I'll handle it, no worries. |
@eviltrout 👍 , testing attached PR today, seems everything is good now |
I reviewed @Fivell's PR and also noticed the unicorn collector has the same problem. It's been fixed in master and I'm releasing a new version right now. |
@eviltrout thanks will finally test tomorrow, we have all of them, unicorns and pumas, one thing I can't figure out is that if discourse run on unicorn and unicorn collector has the same problem, how it can be that according to @SamSaffron
|
@Fivell I reached out to operations when I found the unicorn problem because I had the same suspicion. It turns out the unicorn monitoring as part of this gem is not used by the discourse-prometheus plugin, so that one did not affect us either. |
@eviltrout thanks for clarifications! Good luck |
Currently we are facing issues that the collector process stops working after running for some time (few days) and the /metrics response having "ruby_collector_working 0".
All of our workloads are running in kubernetes and we are running the collector process in a sidecar container. So if the collector web server (WEBrick process) can have a health check endpoint like '/healthz' then we can use it as the container livenessProbe so kubernetes can help us restart the container when the collector stops working.
The text was updated successfully, but these errors were encountered: