Expose a health check endpoint? #69

duanshiqiang · 2019-04-17T04:42:32Z

Currently we are facing issues that the collector process stops working after running for some time (few days) and the /metrics response having "ruby_collector_working 0".

All of our workloads are running in kubernetes and we are running the collector process in a sidecar container. So if the collector web server (WEBrick process) can have a health check endpoint like '/healthz' then we can use it as the container livenessProbe so kubernetes can help us restart the container when the collector stops working.

SamSaffron · 2019-04-17T05:32:58Z

This is a bit odd, we have not seen collectors die across our infrastructure at all.

This metric is there to inform the central server that the collector has no metrics. My guess is that this would happen if you restarted the collector container and somehow your old containers were no longer able to talk to it.

I think the remediation here is actually to restart the entire pod.

Quick question first, do you have IPv6 stability across container destroy/create? Are you communicating via v4? If so do you have port stability?

duanshiqiang · 2019-04-17T06:16:22Z

Hi @SamSaffron Thanks for your reply. We are using ipv4 only in our infrastructure.

We have subclassed PrometheusExporter::Server::WebCollector and running the collector with command like this:

bundle exec prometheus_exporter -a lib/prometheus/web_with_tenant_collector.rb

What we are observing is that the memory usage of the prometheus_exporter can go up even to over 1 GB which is really strange.

Also what you are implying in your reply is that you are running the prometheus_exporter as a central server? We are running it as a sidecar of each puma server (we are having over 90 pods) which I guess is not recommended. And I do find that since some metrics are of type summary, especially those percentiles metrcs, we are not able to write promql to recalculate it among all the server instances which is quite sad.

One thing we are afraid of running the prometheus_expoter as a central server is that there's a single point of failure and we found if the prometheus_expoter is down, our puma server will be flooded by error logs like:

Prometheus Exporter, failed to send message Cannot assign requested address - connect(2) for "localhost" port 9394

SamSaffron · 2019-04-17T06:23:31Z

1GB in a collector is a big concern that would mean some list is ever growing, summary metrics do have a roof on them, but I guess if you keep adding metrics problems can arise. Or maybe have a summary with 10000 different label permutations stuff can get strange.

Can you look at one of your collectors say at 512MB and see how many metrics you have reported? then maybe do a ruby heap dump via rbtrace to see where the leak is.

You can run as many collectors as you need but certainly there is a memory impact, that said it should be small in the big scheme.

This is one of our collectors on a very busy server that has been running since March 27 (RSS is merely 36872)

discour+   109  0.7  0.0 317384 36872 ?        Sl   Mar26 233:13 ruby /var/www/discourse/plugins/discourse-prometheus/bin/collector 9405 79

Let's first work out what your memory issue is prior to attempting workarounds here.

Fivell · 2019-06-13T12:12:06Z

@duanshiqiang we also noticed memory leak in exporter, do u have any updates ?

Fivell · 2019-06-27T09:30:17Z

@SamSaffron

This is one of our collectors on a very busy server that has been running since March 27 (RSS is merely 36872)

are you using last version or maybe some previous one ?

SamSaffron · 2019-06-27T10:48:32Z

Whatever this is I need more data, can you run through the steps here https://samsaffron.com/archive/2015/03/31/debugging-memory-leaks-in-ruby

…

On Thu, 27 Jun 2019 at 7:30 pm, Igor Fedoronchuk ***@***.***> wrote: @SamSaffron <https://github.com/SamSaffron> This is one of our collectors on a very busy server that has been running since March 27 (RSS is merely 36872) are you using last version or maybe some previous one ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69?email_source=notifications&email_token=AAABIXKTALNCT76A7INUTYTP4SCCTA5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYWRHBI#issuecomment-506270597>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABIXJ7SLRSOIDUJGQVGH3P4SCCTANCNFSM4HGQXI4Q> .

Fivell · 2019-07-01T12:13:17Z

@SamSaffron thanks for your article, I started exporter and launched ab script for both metrics endpoint and some rails app endpoint .
After rss became bigger

ps x -o rss,vsz,command | grep prometheus_exporter
 71504  4380208 ruby bin/prometheus_exporter

I started to repeat everything from your guide.
I have dump now and can see generations.

generation  objects 61981
generation 28 objects 2227
generation 29 objects 3697
generation 30 objects 1044
generation 31 objects 2761
generation 32 objects 2
generation 190 objects 22
generation 428 objects 44
generation 429 objects 22
generation 430 objects 22
generation 431 objects 44
generation 432 objects 22
generation 434 objects 26
generation 435 objects 44
generation 436 objects 22
generation 437 objects 22
generation 438 objects 44
generation 439 objects 22
generation 440 objects 22
generation 441 objects 88
generation 442 objects 374
generation 443 objects 264
generation 515 objects 22
generation 779 objects 22
generation 1041 objects 22
generation 1250 objects 22
generation 1508 objects 22
generation 1751 objects 22
generation 1808 objects 22
generation 1940 objects 22
generation 2186 objects 22
generation 2415 objects 22
generation 2527 objects 22
generation 2528 objects 22
generation 2529 objects 22
generation 2530 objects 44
generation 2531 objects 22
generation 2532 objects 22
generation 2533 objects 22
generation 2534 objects 22
generation 2664 objects 22
generation 2914 objects 22
generation 3171 objects 22
generation 3312 objects 1
generation 3315 objects 395
generation 3316 objects 6

So i picked up 3315 generation for detailed analyzing.

/Users/*****************/.rvm/gems/ruby-2.5.5@api3/bundler/gems/prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter.rb:15 * 110
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:532 * 33
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:151 * 24
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/server.rb:286 * 21
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:204 * 12
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:149 * 12
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:222 * 12
/Users/*****************/.rvm/gems/ruby-2.5.5@api3/bundler/gems/prometheus_exporter-0df4dcd79d65/bin/prometheus_exporter:86 * 10
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:196 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:478 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpservlet/abstract.rb:103 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:195 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/server.rb:290 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/server.rb:170 * 7
eval:1 * 7
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:144 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:187 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:255 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:260 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:186 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:33 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/generic.rb:335 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/rfc3986_parser.rb:76 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/rfc3986_parser.rb:44 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/rfc3986_parser.rb:41 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:542 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpresponse.rb:102 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:431 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:430 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:429 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:428 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:426 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:236 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:233 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:176 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:173 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:168 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:166 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:72 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpresponse.rb:108 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpresponse.rb:106 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:71 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/socket.rb:1313 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:290 * 3
/Users/*****************/.rvm/gems/ruby-2.5.5@api3/bundler/gems/prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter/metric/counter.rb:23 * 3
/Users/*****************/.rvm/gems/ruby-2.5.5@api3/bundler/gems/prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter.rb:15 * 110
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:532 * 33
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:151 * 24
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/server.rb:286 * 21
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:204 * 12
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:149 * 12
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:222 * 12
/Users/*****************/.rvm/gems/ruby-2.5.5@api3/bundler/gems/prometheus_exporter-0df4dcd79d65/bin/prometheus_exporter:86 * 10
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:196 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:478 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpservlet/abstract.rb:103 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:195 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/server.rb:290 * 9
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/server.rb:170 * 7
eval:1 * 7
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:144 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:187 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:255 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:260 * 6
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:186 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httputils.rb:33 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/generic.rb:335 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/rfc3986_parser.rb:76 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/rfc3986_parser.rb:44 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/uri/rfc3986_parser.rb:41 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:542 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpresponse.rb:102 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:431 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:430 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:429 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:428 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:426 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:236 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/utils.rb:233 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:176 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:173 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:168 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:166 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:72 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpresponse.rb:108 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpresponse.rb:106 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httpserver.rb:71 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/socket.rb:1313 * 3
/Users/*****************/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/webrick/httprequest.rb:290 * 3
/Users/*****************/.rvm/gems/ruby-2.5.5@api3/bundler/gems/prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter/metric/counter.rb:23 * 3

lib/prometheus_exporter.rb:15 * 110 points to Oj.compat_load(obj)

Can you please suggest what next to do ?

SamSaffron · 2019-07-02T00:04:27Z

3315 is not a good one to pic ...

generation 1250 objects 22
generation 1508 objects 22

This looks like a steady object leak of sorts, what are the objects?

Note rss there is 71504 which is not enormous but could be better.

Fivell · 2019-07-02T07:01:48Z

@SamSaffron , they all are pointed to prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter.rb:15 * 22

SamSaffron · 2019-07-02T07:26:27Z

That is an indicator that you are collecting a giant pile of metric. What metrics do you have, do you have a script that can repro the issue?

…

On Tue, Jul 2, 2019 at 5:01 PM Igor Fedoronchuk ***@***.***> wrote: @SamSaffron <https://github.com/SamSaffron> , they all are pointed to prometheus_exporter-0df4dcd79d65/lib/prometheus_exporter.rb:15 * 22 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69?email_source=notifications&email_token=AAABIXLRMZYWTZS6W4QI6HLP5L4NZA5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZAI7SA#issuecomment-507547592>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABIXNUMNAIETOQL3QMSFTP5L4NZANCNFSM4HGQXI4Q> .

Fivell · 2019-07-02T07:35:31Z

@SamSaffron https://gist.github.com/Fivell/224a9cda0fadf1ff323a730c5a0d36dc , as I told i just call several endpoints thousands times using ab

Fivell · 2019-07-02T14:22:18Z

@SamSaffron I also repeated that on staging env
after 1 day logs are full of

Jul 02 13:08:11 billing-prometheus-exrpoter-debug bundle[1579]: Generating Prometheus metrics text timed out

was created as separate issue(#77)

metrics output

# HELP ruby_collector_working Is the master process collector able to collect metrics
# TYPE ruby_collector_working gauge
ruby_collector_working 0


# HELP ruby_collector_rss total memory used by collector process
# TYPE ruby_collector_rss gauge
ruby_collector_rss 571297792


# HELP ruby_collector_metrics_total Total metrics processed by exporter web.
# TYPE ruby_collector_metrics_total counter
ruby_collector_metrics_total 85085


# HELP ruby_collector_sessions_total Total send_metric sessions processed by exporter web.
# TYPE ruby_collector_sessions_total counter
ruby_collector_sessions_total 13350


# HELP ruby_collector_bad_metrics_total Total mis-handled metrics by collector.
# TYPE ruby_collector_bad_metrics_total counter
ruby_collector_bad_metrics_total 0

and heap data analyzing is also pointed to

/root/prometheus_exporter/lib/prometheus_exporter.rb:15 * 1339

as u can see ruby_collector_rss 571297792 from metrics, so i guess it is really leaking, i think @duanshiqiang will have same results

Fivell · 2019-07-02T17:51:15Z

after trying json instead of oj have same results but /usr/lib/ruby/2.5.0/json/common.rb:156 * 1040

@SamSaffron found your heapy gem, very helpfull
results of dump analyzing https://gist.github.com/Fivell/8766b65d6e80620ca56edf0702f6ca9f

SamSaffron · 2019-07-03T07:00:24Z

Fundamentally all the data is entering from a single entry point:

https://github.com/discourse/prometheus_exporter/blob/master/lib/prometheus_exporter/server/collector.rb#L26-L28

So what you can do here to create a 100% reproduction of the issue is log the "string" and "timestamp" to a file.

Then if we have the file we can replay it ... mock time.now and Process.clock_gettime(Process::CLOCK_MONOTONIC) and reproduce the issue 100% consistently.

Note that heapy dump is not showing anything terribly bad, yes there is bits of duplication but only 20mb or so are in Ruby heaps.

Also I wonder maybe we are somehow getting VSZ on your server vs RSS?

Fivell · 2019-07-03T16:01:28Z

@SamSaffron sample attached
prometheus-exrpoter-debug.log

SamSaffron · 2019-07-04T01:08:30Z

That is not nearly enough data imo... Ruby boots at like 30mb and can get to 60mb very easily with almost no load. What we do at Discourse we run Ruby with a tighter memory footprint by doing: https://github.com/discourse/discourse-prometheus/blob/f5f791f0312c3a25fca67172c991505bdce44f13/lib/demon.rb#L18-L19 What I would like to see here is the graph of it reaching same 200M with the attached logs. In the mean time it may be worth working on a standalone repro. ``` collector = PrometheusExporter::Server.Collector.new process(log) do |str, time| mock_time(time) collector.process(str) end ``` If we can get that then to repro a leak we can debug it very cleanly with the memory profiler gem. Cause it basically will take seconds to reach the same memory bloat you are seeing taking days.

…

On Thu, Jul 4, 2019 at 2:01 AM Igor Fedoronchuk ***@***.***> wrote: @SamSaffron <https://github.com/SamSaffron> sample attached prometheus-exrpoter-debug.log <https://github.com/discourse/prometheus_exporter/files/3355770/prometheus-exrpoter-debug.log> [image: image] <https://user-images.githubusercontent.com/120269/60606917-e4d7a780-9dc4-11e9-8085-2fcf2b8927ec.png> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69?email_source=notifications&email_token=AAABIXK23KQCATRNTZQHCFDP5TENRA5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZE5I4Q#issuecomment-508154994>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABIXLUSV4UI2AFV3VLMPDP5TENRANCNFSM4HGQXI4Q> .

Fivell · 2019-07-04T08:55:18Z

@SamSaffron I'm not sure time is somehow affects.
FYI launched exporter locally with smaller frequency of PrometheusExporter::Instrumentation::Puma like this

after_worker_boot do
    require 'prometheus_exporter/instrumentation'
    PrometheusExporter::Instrumentation::Puma.start(frequency: 1)
  end
end

and have very soon this picture

 ps x -o rss,vsz,command | grep prometheus_exporter
196764  4646520 ruby /Users/****/.rvm/gems/ruby-2.5.5@api3/bin/prometheus_exporter

log.txt.zip

Fivell · 2019-07-08T19:09:34Z

@SamSaffron sample of this but without time mocking
https://github.com/Fivell/exp_memory_test

shows after 25m file processing next

Total allocated: 267388634 bytes (2373642 objects)
Total retained:  83211730 bytes (864334 objects)

allocated memory by gem
-----------------------------------
 220305928  json
  47082706  other

allocated memory by file
-----------------------------------
 220305928  /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb
  47082706  test.rb

allocated memory by location
-----------------------------------
 217064088  /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:156
  43770282  test.rb:12
   3312344  test.rb:11
   3241840  /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:155
        80  test.rb:13

allocated memory by class
-----------------------------------
 111253154  String
  94661728  JSON::Ext::Parser
  58161328  Hash
   3303920  Array
      8424  File
        80  Proc

allocated objects by gem
-----------------------------------
   1959164  json
    414478  other

allocated objects by file
-----------------------------------
   1959164  /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb
    414478  test.rb

allocated objects by location
-----------------------------------
   1878118  /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:156
    331878  test.rb:12
     82599  test.rb:11
     81046  /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:155
         1  test.rb:13

allocated objects by class
-----------------------------------
   1964376  String
    245620  Hash
     82598  Array
     81046  JSON::Ext::Parser
         1  File
         1  Proc

retained memory by gem
-----------------------------------
  83211730  json

any suggestions?

SamSaffron · 2019-07-08T20:00:30Z

Oh my ... this puma collector is a GIANT problem. Compare to this: https://github.com/discourse/prometheus_exporter/blob/32a6bd1f6c670096b4cc230b242a1f0cb8dcfb89/lib/prometheus_exporter/server/process_collector.rb#L64-L67 Can you see how it follows a clean pattern to clear up history. Puma collector is doing none of this. https://github.com/discourse/prometheus_exporter/blob/32a6bd1f6c670096b4cc230b242a1f0cb8dcfb89/lib/prometheus_exporter/server/puma_collector.rb#L49-L51 I accepted this as a community contribution as I am not a puma user in my projects. But sadly this is a giant enormous issue. https://github.com/discourse/prometheus_exporter/pull/34/files needs to be redone to follow the same practice we do process collector. The key here is instead of growing an array forever in the collector, this needs to have some sort of max age defined so there is no situation where. Can you try a PR to fix this, just copy the pattern used in process collector for max age and see if you can figure out a clean way of deduping (not critical cause this thing is 100% gauges), note how you will have to start mocking time to see the fix actually working. Thank you so much for persisting here! You went above and beyond. I am sadly on holidays at the moment (for another 3 weeks) so I can not really work on this but I will merge and push a gem once you test the fix.

…

On Mon, Jul 8, 2019 at 9:09 PM Igor Fedoronchuk ***@***.***> wrote: @SamSaffron <https://github.com/SamSaffron> sample of this but without time mocking https://github.com/Fivell/exp_memory_test shows after 25m file processing next Total allocated: 267388634 bytes (2373642 objects) Total retained: 83211730 bytes (864334 objects) allocated memory by gem ----------------------------------- 220305928 json 47082706 other allocated memory by file ----------------------------------- 220305928 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb 47082706 test.rb allocated memory by location ----------------------------------- 217064088 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:156 43770282 test.rb:12 3312344 test.rb:11 3241840 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:155 80 test.rb:13 allocated memory by class ----------------------------------- 111253154 String 94661728 JSON::Ext::Parser 58161328 Hash 3303920 Array 8424 File 80 Proc allocated objects by gem ----------------------------------- 1959164 json 414478 other allocated objects by file ----------------------------------- 1959164 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb 414478 test.rb allocated objects by location ----------------------------------- 1878118 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:156 331878 test.rb:12 82599 test.rb:11 81046 /Users/igorfedoronchuk/.rvm/rubies/ruby-2.5.5/lib/ruby/2.5.0/json/common.rb:155 1 test.rb:13 allocated objects by class ----------------------------------- 1964376 String 245620 Hash 82598 Array 81046 JSON::Ext::Parser 1 File 1 Proc retained memory by gem ----------------------------------- 83211730 json any suggestions? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#69?email_source=notifications&email_token=AAABIXM6XRTSI4PI4DKWM6DP6OGG5A5CNFSM4HGQXI42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZOB5FI#issuecomment-509353621>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAABIXMD7XYHWYZWLSU5KNTP6OGG5ANCNFSM4HGQXI4Q> .

SamSaffron · 2019-07-08T20:08:07Z

@eviltrout I can not fix this... but it is a very big mess that impacts every consumer of this gem that uses Puma. (it does not impact Discourse, but is very not good)

Can you get someone to fix this and push a new gem out there? I just gave you push on the gem.

Fix is pretty straightforward, follow same pattern process collector uses so array does not grow forever.

Fivell · 2019-07-09T09:47:50Z

@SamSaffron started testing on staging with this didww@809b382

eviltrout · 2019-07-09T16:37:52Z

@SamSaffron I'll handle it, no worries.

Fivell · 2019-07-09T17:06:00Z

@eviltrout 👍 , testing attached PR today, seems everything is good now

eviltrout · 2019-07-09T17:59:11Z

I reviewed @Fivell's PR and also noticed the unicorn collector has the same problem. It's been fixed in master and I'm releasing a new version right now.

Fivell · 2019-07-09T18:34:17Z

@eviltrout thanks will finally test tomorrow, we have all of them, unicorns and pumas, one thing I can't figure out is that if discourse run on unicorn and unicorn collector has the same problem, how it can be that according to @SamSaffron

This is one of our collectors on a very busy server that has been running since March 27 (RSS is merely 36872)

eviltrout · 2019-07-09T18:41:47Z

@Fivell I reached out to operations when I found the unicorn problem because I had the same suspicion. It turns out the unicorn monitoring as part of this gem is not used by the discourse-prometheus plugin, so that one did not affect us either.

Fivell · 2019-07-09T19:18:48Z

@eviltrout thanks for clarifications! Good luck

Fivell mentioned this issue Jul 5, 2019

Memory usage in Collector #78

Open

Fivell mentioned this issue Jul 9, 2019

delete old metrics in puma collector #80

Closed

eviltrout closed this as completed Jul 9, 2019

kubibektas mentioned this issue Dec 3, 2020

Add health check endpoint for the collector #145

Open

adamk9k mentioned this issue Mar 23, 2023

Was the Puma memory leak fix intentionally "reverted"? #275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose a health check endpoint? #69

Expose a health check endpoint? #69

duanshiqiang commented Apr 17, 2019

SamSaffron commented Apr 17, 2019

duanshiqiang commented Apr 17, 2019

SamSaffron commented Apr 17, 2019 •

edited

Fivell commented Jun 13, 2019

Fivell commented Jun 27, 2019

SamSaffron commented Jun 27, 2019 via email

Fivell commented Jul 1, 2019

SamSaffron commented Jul 2, 2019

Fivell commented Jul 2, 2019

SamSaffron commented Jul 2, 2019 via email

Fivell commented Jul 2, 2019

Fivell commented Jul 2, 2019 •

edited

Fivell commented Jul 2, 2019 •

edited

SamSaffron commented Jul 3, 2019

Fivell commented Jul 3, 2019

SamSaffron commented Jul 4, 2019 via email

Fivell commented Jul 4, 2019 •

edited

Fivell commented Jul 8, 2019

SamSaffron commented Jul 8, 2019 via email

SamSaffron commented Jul 8, 2019

Fivell commented Jul 9, 2019

eviltrout commented Jul 9, 2019

Fivell commented Jul 9, 2019 •

edited

eviltrout commented Jul 9, 2019

Fivell commented Jul 9, 2019

eviltrout commented Jul 9, 2019

Fivell commented Jul 9, 2019

Expose a health check endpoint? #69

Expose a health check endpoint? #69

Comments

duanshiqiang commented Apr 17, 2019

SamSaffron commented Apr 17, 2019

duanshiqiang commented Apr 17, 2019

SamSaffron commented Apr 17, 2019 • edited

Fivell commented Jun 13, 2019

Fivell commented Jun 27, 2019

SamSaffron commented Jun 27, 2019 via email

Fivell commented Jul 1, 2019

SamSaffron commented Jul 2, 2019

Fivell commented Jul 2, 2019

SamSaffron commented Jul 2, 2019 via email

Fivell commented Jul 2, 2019

Fivell commented Jul 2, 2019 • edited

Fivell commented Jul 2, 2019 • edited

SamSaffron commented Jul 3, 2019

Fivell commented Jul 3, 2019

SamSaffron commented Jul 4, 2019 via email

Fivell commented Jul 4, 2019 • edited

Fivell commented Jul 8, 2019

SamSaffron commented Jul 8, 2019 via email

SamSaffron commented Jul 8, 2019

Fivell commented Jul 9, 2019

eviltrout commented Jul 9, 2019

Fivell commented Jul 9, 2019 • edited

eviltrout commented Jul 9, 2019

Fivell commented Jul 9, 2019

eviltrout commented Jul 9, 2019

Fivell commented Jul 9, 2019

SamSaffron commented Apr 17, 2019 •

edited

Fivell commented Jul 2, 2019 •

edited

Fivell commented Jul 2, 2019 •

edited

Fivell commented Jul 4, 2019 •

edited

Fivell commented Jul 9, 2019 •

edited