Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write_prometheus plugin: New plugin for exposing metrics to Prometheus. #1967

Merged
merged 6 commits into from Nov 11, 2016

Conversation

octo
Copy link
Member

@octo octo commented Sep 29, 2016

This new plugin embeds an HTTP server that can be scraped from Prometheus, similar to the way collectd_exporter works. The metric family names and labels are created in the same manner as collectd_exporter so this plugin can function as a drop-in replacement.

@mfournier
Copy link

This looks promising, thanks a lot for working on that !

I'll take a moment today to rebuild the jenkins workers with microhttpd, and re-trigger the builds.

Count on me for testing this out in the near future :-)

@octo
Copy link
Member Author

octo commented Sep 30, 2016

Thanks @mfournier, I was about to ask you how to get libmicrohttpd on the build minions ;)

@octo
Copy link
Member Author

octo commented Sep 30, 2016

Let me point out some differences to collectd_exporter to look out for:

  • collectd_exporter adds the "_total" suffix to COUNTER metrics, which are essentially unused. write_prometheus adds the "_total" suffic to COUNTER and DERIVE metrics. I intend to send a PR to collectd_exporter fixing this.
  • write_prometheus exports the time at which metrics were read. This should improve the precision of the irate() function and avoids clock skew. Due to a limitation in the protocol, this does not work for metrics with an interval longer than 5 minutes.
  • write_prometheus will stop exporting a metric when it is no longer updated. collectd_exporter will report that metric until it is restarted, iIuc.

@octo
Copy link
Member Author

octo commented Sep 30, 2016

There are some build issues on Precise, presumably due to an old version of libmicrohttpd:

write_prometheus.c: In function 'http_handler':
write_prometheus.c:213:10: error: implicit declaration of function 'MHD_create_response_from_buffer' [-Werror=implicit-function-declaration]
write_prometheus.c:214:32: error: 'MHD_RESPMEM_MUST_COPY' undeclared (first use in this function)
write_prometheus.c:214:32: note: each undeclared identifier is reported only once for each function it appears in
write_prometheus.c: In function 'prom_init':
write_prometheus.c:676:62: error: 'MHD_USE_DUAL_STACK' undeclared (first use in this function)
cc1: all warnings being treated as errors
make[3]: *** [write_prometheus_la-write_prometheus.lo] Error 1

@octo
Copy link
Member Author

octo commented Oct 1, 2016

Wheezy is still having issues:

write_prometheus.c: In function 'prom_init':
write_prometheus.c:687:14: error: 'MHD_USE_DUAL_STACK' undeclared (first use in this function)
write_prometheus.c:687:14: note: each undeclared identifier is reported only once for each function it appears in

Wheezy has libmicrohttpd-dev 0.9.20-1+deb7u1.

@octo
Copy link
Member Author

octo commented Oct 1, 2016

Great success!

@bs-github
Copy link
Contributor

bs-github commented Oct 18, 2016

I'm quite keen to get this plugin, because the collectd_exporter is a thing that adds latency and can possibly fail. Therefore a big thanks for getting this in @octo!
That said, I've tested the write_prometheus plugin and it seems to have bugs that make collectd crash. :-(

In my test setup (>25k lines in /metrics) I get lots of errors like this:

[2016-10-18 06:51:28] write_prometheus plugin: Deleting a metric in family "collectd_interface_if_octets_tx_total" failed with status 2
[2016-10-18 06:51:28] write_prometheus plugin: Deleting a metric in family "collectd_interface_if_packets_rx_total" failed with status 2
[2016-10-18 06:51:28] write_prometheus plugin: Deleting a metric in family "collectd_interface_if_errors_rx_total" failed with status 2
[2016-10-18 06:51:28] write_prometheus plugin: Deleting a metric in family "collectd_interface_if_errors_tx_total" failed with status 2

I naively just assume this is the reason why the process becomes bigger and bigger over time (starting single digit GB, grows up to the point where there is no more free memory) and eventually gets oom killed (kernel: [15057614.433287] Killed process 5234 (collectd) total-vm:121211840kB, anon-rss:58774940kB, file-rss:0kB).

There is a second thing that I think is unrelated, but I'm not sure where else to put that. And maybe it's related as well, as I've not noticed that error before playing with this plugin.

[2016-10-19 05:24:35] plugin_value_list_clone: Unable to determine interval from context for value list "syd07/collectd-write_queue/queue_length". This indicates a broken plugin. Please report this problem to the collectd mailing list or at <http://collectd.org/bugs/>.
[2016-10-19 05:24:35] plugin_value_list_clone: Unable to determine interval from context for value list "syd07/collectd-write_queue/derive-dropped". This indicates a broken plugin. Please report this problem to the collectd mailing list or at <http://collectd.org/bugs/>.
[2016-10-19 05:24:35] plugin_value_list_clone: Unable to determine interval from context for value list "syd07/collectd-cache/cache_size". This indicates a broken plugin. Please report this problem to the collectd mailing list or at <http://collectd.org/bugs/>.

@octo
Copy link
Member Author

octo commented Oct 18, 2016

Hi @bs-github, thanks for your feedback!

"Status 2" sounds like ENOENT – either the metric never made it to the write_prometheus plugin or the "missing" event was received more than once. Either way something we should look into.

I've run the plugin with Valgrind, but that of course only means that the shutdown handler works as intended. I'll see if I can track down a life-leak somewhere …

Best regards,
—octo

@bs-github
Copy link
Contributor

@octo: what you describe as the missing event was received more than once is highly likely what happens here. I have some issues with such cases in my setup.
Basically what causes that, is more than one collectd instance running on a box and all of them reporting to the same upstream collector.
This case is unusual, but nothing that can be ruled out and therefore I think collectd should be able to handle it.

@bs-github
Copy link
Contributor

Another thing: I'm quite sure, that collectd_exporter will stop exporting a metric when it is no longer updated as well. See here: https://github.com/prometheus/collectd_exporter/blob/master/main.go#L170-L179

@bs-github
Copy link
Contributor

@octo: As you suggested I added an "abort();" at the beginning of the shutdown callback. Then I've recompiled with debug enabled ./configure --enable-debug CFLAGS='-O0 -g' && make clean all install and let it run under Valgrind, but that did not made the leak obvious to me ... (see bottom of this post).

But the last line in the log is catching my eye. Is that a normal thing caused by the abort call?

[2016-10-21 22:39:54] Exiting normally.
[2016-10-21 22:39:54] collectd: Stopping 5 read threads.
[2016-10-21 22:39:54] plugin: stop_read_threads: Signalling `read_cond'
[2016-10-21 22:39:54] collectd: Stopping 5 write threads.
[2016-10-21 22:39:54] plugin: stop_write_threads: Signalling `write_cond'
[2016-10-21 22:41:00] plugin: 4354997 value lists were left after shutting down the write threads.

I've not yet done the thing with the traffic generator, neither did I cut of the regular feed that I use for testing.

$ sudo valgrind --leak-check=full /opt/collectd/sbin/collectd -f -C /etc/collectd2/collectd.conf
==28970== Memcheck, a memory error detector
==28970== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al.
==28970== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info
==28970== Command: /opt/collectd/sbin/collectd -f -C /etc/collectd2/collectd.conf
==28970==
option = Hostname; value = syd07-collectd-prometheus;
option = FQDNLookup; value = true;
Done parsing `/usr/share/collectd/types.db'
Replacing DS `fork_rate' with another version.
Replacing DS `duration' with another version.
Done parsing `/etc/collectd/types.db'
Created new plugin context.
^C
==28970==
==28970== HEAP SUMMARY:
==28970==     in use at exit: 4,598,802 bytes in 55,478 blocks
==28970==   total heap usage: 52,571,322 allocs, 52,515,844 frees, 7,651,260,392 bytes allocated
==28970==
==28970== 8 bytes in 1 blocks are definitely lost in loss record 13 of 119
==28970==    at 0x4C2B6CD: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28970==    by 0x54E8CA1: strdup (strdup.c:43)
==28970==    by 0x4231D8: yyparse (parser.y:110)
==28970==    by 0x41FB2F: oconfig_parse_fh (oconfig.c:67)
==28970==    by 0x41FC32: oconfig_parse_file (oconfig.c:97)
==28970==    by 0x408F8D: cf_read_file (configfile.c:662)
==28970==    by 0x4096EC: cf_read_generic (configfile.c:851)
==28970==    by 0x409ED0: cf_read (configfile.c:1108)
==28970==    by 0x40746B: main (collectd.c:576)
==28970==
==28970== 240 bytes in 1 blocks are definitely lost in loss record 82 of 119
==28970==    at 0x4C2B7B2: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28970==    by 0x408BF4: cf_ci_append_children (configfile.c:555)
==28970==    by 0x40978B: cf_read_generic (configfile.c:866)
==28970==    by 0x408DF4: cf_include_all (configfile.c:610)
==28970==    by 0x408FCB: cf_read_file (configfile.c:669)
==28970==    by 0x4096EC: cf_read_generic (configfile.c:851)
==28970==    by 0x409ED0: cf_read (configfile.c:1108)
==28970==    by 0x40746B: main (collectd.c:576)
==28970==
==28970== 272 bytes in 1 blocks are possibly lost in loss record 84 of 119
==28970==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28970==    by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==28970==    by 0x4E3AABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==28970==    by 0x412EFA: plugin_thread_create (plugin.c:2808)
==28970==    by 0x602FB00: network_init (network.c:3477)
==28970==    by 0x410BD7: plugin_init_all (plugin.c:1750)
==28970==    by 0x406C4C: do_init (collectd.c:331)
==28970==    by 0x4078DE: main (collectd.c:728)
==28970==
==28970== 272 bytes in 1 blocks are possibly lost in loss record 85 of 119
==28970==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28970==    by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==28970==    by 0x4E3AABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==28970==    by 0x412EFA: plugin_thread_create (plugin.c:2808)
==28970==    by 0x602FB78: network_init (network.c:3497)
==28970==    by 0x410BD7: plugin_init_all (plugin.c:1750)
==28970==    by 0x406C4C: do_init (collectd.c:331)
==28970==    by 0x4078DE: main (collectd.c:728)
==28970==
==28970== 272 bytes in 1 blocks are possibly lost in loss record 86 of 119
==28970==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28970==    by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==28970==    by 0x4E3AABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==28970==    by 0x6AD093B: MHD_start_daemon_va (in /usr/lib/libmicrohttpd.so.5.2.1)
==28970==    by 0x6AD0ABE: MHD_start_daemon (in /usr/lib/libmicrohttpd.so.5.2.1)
==28970==    by 0x66BB764: prom_init (write_prometheus.c:716)
==28970==    by 0x410BD7: plugin_init_all (plugin.c:1750)
==28970==    by 0x406C4C: do_init (collectd.c:331)
==28970==    by 0x4078DE: main (collectd.c:728)
==28970==
==28970== 272 bytes in 1 blocks are possibly lost in loss record 87 of 119
==28970==    at 0x4C29DB4: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==28970==    by 0x4012034: _dl_allocate_tls (dl-tls.c:297)
==28970==    by 0x4E3AABC: pthread_create@@GLIBC_2.2.5 (allocatestack.c:571)
==28970==    by 0x40EF68: start_write_threads (plugin.c:882)
==28970==    by 0x410C65: plugin_init_all (plugin.c:1770)
==28970==    by 0x406C4C: do_init (collectd.c:331)
==28970==    by 0x4078DE: main (collectd.c:728)
==28970==
==28970== LEAK SUMMARY:
==28970==    definitely lost: 248 bytes in 2 blocks
==28970==    indirectly lost: 0 bytes in 0 blocks
==28970==      possibly lost: 1,088 bytes in 4 blocks
==28970==    still reachable: 4,597,466 bytes in 55,472 blocks
==28970==         suppressed: 0 bytes in 0 blocks
==28970== Reachable blocks (those to which a pointer was found) are not shown.
==28970== To see them, rerun with: --leak-check=full --show-reachable=yes
==28970==
==28970== For counts of detected and suppressed errors, rerun with: -v
==28970== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 2 from 2)

@mfournier
Copy link

mfournier commented Nov 1, 2016

I get the following error when building against libmicrohttpd 0.9.51:

write_prometheus.c: In function ‘http_handler’:                                                 
write_prometheus.c:213:10: error: ‘MHD_create_response_from_data’ is deprecated: MHD_create_resp
onse_from_data() is deprecated, use MHD_create_response_from_buffer() [-Werror=deprecated-declar
ations]                                                                                         
   struct MHD_Response *res = MHD_create_response_from_data(                                    
          ^~~~~~~~~~~~                                                                          
In file included from write_prometheus.c:37:0:                                                  
/usr/include/microhttpd.h:2079:1: note: declared here                                           
 MHD_create_response_from_data (size_t size,                                                    
 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~                                                                  
cc1: all warnings being treated as errors                                                       
Makefile:4954: recipe for target 'write_prometheus_la-write_prometheus.lo' failed               
make[3]: *** [write_prometheus_la-write_prometheus.lo] Error 1

MHD_create_response_from_buffer() got added in version 0.9.5 (January 2011). Here's a fix: mfournier@069233d

edit: it seems github now allows pushing to other peoples branches, so I did just this against https://github.com/octo/collectd/tree/ff/write_prometheus.

@mfournier
Copy link

@bs-github can you try running collectd under valgrind's massif tool ?

valgrind --tool=massif /opt/collectd/sbin/collectd -f (let this run for a few minutes/hours)
then ctrl-c and run the following command against the file created in the current directory:
ms_print massif.out.<pid> |less

This should show the function calls which allocate memory over time, and hopefully give a hint on what's going on.

Thanks !

@bs-github
Copy link
Contributor

results from running under valgrind --tool=massif:

--------------------------------------------------------------------------------
Command:            /opt/collectd/sbin/collectd -f -C /etc/collectd2/collectd.conf
Massif arguments:   (none)
ms_print arguments: /tmp/massif.out.28631
--------------------------------------------------------------------------------


    GB
2.786^                                                                       @
     |                                                                    @@@#
     |                                                                @@::@@@#
     |                                                            :@@@@ : @@@#
     |                                                        @@:::@ @@ : @@@#
     |                                                    @@@:@@: :@ @@ : @@@#
     |                                                 :@@@@ :@@: :@ @@ : @@@#
     |                                             @@@@:@ @@ :@@: :@ @@ : @@@#
     |                                          @@@@ @ :@ @@ :@@: :@ @@ : @@@#
     |                                       @::@@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |                                  :::::@: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |                               ::::: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |                            ::::: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |                         @:::: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |                     @:::@: :: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |                  :::@:: @: :: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |              @:@@:::@:: @: :: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |          ::::@:@ :::@:: @: :: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |       :::::: @:@ :::@:: @: :: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
     |   :::::: ::: @:@ :::@:: @: :: :: :: : @: @@ @ @ :@ @@ :@@: :@ @@ : @@@#
   0 +----------------------------------------------------------------------->Ti
     0                                                                   1.813

Number of snapshots: 98
 Detailed snapshots: [9, 11, 15, 18, 27, 29, 30, 32, 33, 35, 36, 37, 39, 40, 43, 44, 45, 47, 48, 54, 64, 74, 84, 85 (peak), 95]

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1 57,517,458,822      116,721,856      103,655,902    13,065,954            0
  2 94,812,240,235      172,278,952      154,147,889    18,131,063            0
  3 148,532,893,308      285,329,408      256,304,993    29,024,415            0
  4 206,004,080,697      358,918,232      321,897,582    37,020,650            0
  5 240,751,149,092      404,740,584      363,426,620    41,313,964            0
  6 280,540,577,619      473,201,184      425,014,903    48,186,281            0
  7 309,051,099,837      498,930,184      448,313,926    50,616,258            0
  8 348,923,486,085      547,436,520      492,588,179    54,848,341            0
  9 393,808,008,078      624,467,800      563,117,653    61,350,147            0
90.18% (563,117,653B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->67.10% (419,010,560B) 0x40E923: plugin_value_list_clone (plugin.c:721)
| ->67.10% (419,010,560B) 0x40EC68: plugin_write_enqueue (plugin.c:782)
|   ->67.10% (419,010,560B) 0x411B77: plugin_dispatch_values (plugin.c:2287)
|     ->67.05% (418,690,280B) 0x6023CD9: network_dispatch_values (network.c:459)
|     | ->67.05% (418,690,280B) 0x6025D09: parse_packet (network.c:1491)
|     |   ->67.05% (418,690,280B) 0x6027EC1: dispatch_thread (network.c:2429)
|     |     ->67.05% (418,690,280B) 0x412E67: plugin_thread_start (plugin.c:2792)
|     |       ->67.05% (418,690,280B) 0x4E34E98: start_thread (pthread_create.c:308)
|     |
|     ->00.05% (320,280B) in 1+ places, all below ms_print's threshold (01.00%)
...

@octo
Copy link
Member Author

octo commented Nov 2, 2016

@bs-github Could you please confirm that this is related to the write_prometheus plugin? The trace indicates that this is leaked in a location that is not changed by this PR.

If the memory grows without this plugin, too, please open a separate issue and attach the entire massif output file? It ran this plugin with massif last week and also found that lots of memory is allocated in plugin_value_list_clone(), so I think its quite likely that this may be related to some other change.

Thanks and best regards,
—octo

@octo
Copy link
Member Author

octo commented Nov 7, 2016

@bs-github ping?

@octo octo added this to the 5.7 milestone Nov 7, 2016
@rubenk
Copy link
Contributor

rubenk commented Nov 8, 2016

FWIW, I noticed a leak in 5.6 too without this plugin.

@rubenk
Copy link
Contributor

rubenk commented Nov 8, 2016

As for the PR itself, it looks good to me, nice work @octo!

@bs-github
Copy link
Contributor

Sorry for the late reply, I've been busy with other things.
Beside that, I'm not in the position anymore to test with heavy real world workload.

But to get back on topic, as far as I remember, I've had an instance of collectd (same binary, separate config) running that only worked as a forwarder (network proxy setup). That instance had no memory issues (or just such minor ones that I've not noticed them).

This observation does indeed not tell us that the issue is with this PR. In fact, at this point, I doubt it.

So, I'm good with moving this issue out of this PR.

@octo I do have the full massif.out still around, can you please give me some advice where you want me to put that?

@octo
Copy link
Member Author

octo commented Nov 11, 2016

Alright, I'll go ahead and merge this. @rubenk, @bs-github, could one of you please open a bug report against 5.6 regarding the memory leak? Birger, you can attach the massif output to the issue by drag-and-dropping it into the browser window.

Thanks everybody! 🎉
—:octopus:

octo and others added 5 commits November 11, 2016 14:42
Profiling showed that prom_write() spent 73% of its time in this
function. 36% of time was spent in metric_create() and 19% was spent in
metric_destroy().

This patch replaces these two calls by a stack allocation, reducing the
time prom_write() spends in metric_family_get_metric() to 42%.
This function is a hotspot because it is used by bsearch() to look up
metrics in a metric family. This simple (though non-obvious) change
brings prom_write() complexity down from 3000 instructions/call to 2640
instructions/call, i.e. a 12% improvement.
Add switch on MHD_VERSION to support both legacy and modern MHD functions.

`MHD_create_response_from_data()` is deprecated since libmicrohttpd
0.9.5 and makes the build fail since 0.9.45.
@octo
Copy link
Member Author

octo commented Nov 11, 2016

Rebased on a more recent master. Once the build bots are happy I'll merge.

…_DEFAULT_STALENESS_DELTA.

Fixes:

    write_prometheus.c:56:1: error: initializer element is not constant
     static cdtime_t staleness_delta = PROMETHEUS_DEFAULT_STALENESS_DELTA;
     ^
@octo octo merged commit 604607d into collectd:master Nov 11, 2016
Copy link

@brian-brazil brian-brazil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Here's some things I noticed in passing.


Port the embedded webserver should listen on. Defaults to B<9103>.

=item B<StalenessDelta> I<Seconds>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this setting is going to cause weirdness. If you're going to expose timestamps, always expose timestamps for a use case like this.

This is also what you will be doing once we fix staleness handling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the behavior that came out of the discussion in prometheus/collectd_exporter#24. Can you give me a reference to where this fix is being discussed or implemented? If that fix lands before collectd 5.7 is released I'm happy to remove this setting. Otherwise, we're going to add a deprecation notice when this setting is no longer needed and eventually change the default.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only discussed GC, not this sort of behaviour. The issue is what you've implemented here will break both now and once we finally fix staleness.

Consider that there's a metric being forwarded through a few systems, so that it's hovering around the 5m old mark when it gets to Prometheus. Every time it goes over 5m, Prometheus will see it as now. As Prometheus only allows you to append samples in order, this means the next 5 minutes of samples under 5 minutes old will be rejected and errors logged.

To avoid this you should always expose timestamps.

ssnprintf(line, sizeof(line), "%s{%s} " GAUGE_FORMAT "%s\n", fam->name,
format_labels(labels, sizeof(labels), m), m->gauge->value,
timestamp_ms);
else /* if (fam->type == IO__PROMETHEUS__CLIENT__METRIC_TYPE__COUNTER) */

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation doesn't handle Histograms or Summarys correctly. While collectd won't be exporting any, can you add a comment warning anyone looking to reuse this code elsewhere?

c_avl_iterator_destroy(iter);

char server[1024];
ssnprintf(server, sizeof(server), "\n# collectd/write_prometheus %s at %s\n",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use the approach at https://www.robustperception.io/exposing-the-software-version-to-prometheus/ to get the version into Prometheus


for (size_t i = 0; i < m->n_label; i++)
ssnprintf(labels[i], LABEL_BUFFER_SIZE, "%s=\"%s\"", m->label[i]->name,
m->label[i]->value);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure the values won't need escaping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2035

* 3 labels:
* [0] $plugin="$plugin_instance" => $plugin is the same within a family
* [1] type="$type_instance" => "type" is a static string
* [2] instance="$host" => "instance" is a static string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we presume a collectd per machine with no forwarding then there's no need to expose an instance label, as Prometheus already knows it. The user will need to remove the exported_instance label you're adding here (prefixed due to collision handling) in metric_relabel_configs.

If there is forwarding going on, then the docs should mention to enable honor_labels: true. Getting Prometheus hitting all the collectds directly would be preferred.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there already exists several ways to forward metrics from one collectd to another, we can expect some users will setup a central instance collecting values from a fleet of agents, and have prometheus point to this single central collectd.

I agree with you it's better to have collectd hit the "data source" directly, but I'm also aware some people have legacy machines or exotic architectures on which they won't/can't install the latest collectd, nor any prometheus exporter. These will be the typical users of forwarding I guess.

@brian-brazil
Copy link

A general question, how long would you expect this code to get out to the majority of your users? Once this is fully deployed, we can deprecate the collectd_exporter.

@mfournier
Copy link

@brian-brazil the write_prometheus plugin will be part of collectd 5.7.0, due in roughly one month from now. I can't give a good answer about when it would be "fully deployed", but most linux/BSD distros usually have collectd releases packaged within a couple of months.

I wouldn't depreciate the collectd_exporter too quickly though (cf the folks blocked with a collectd version from several years ago).

@mfournier
Copy link

Also, many thanks @octo ! Having a proper bridge between these 2 awesome tools is really great !

@rubenk
Copy link
Contributor

rubenk commented Nov 16, 2016

@octo I kept an eye on my collectd instance, and for the last 3 days the RSS stayed the same at 160MB, VSS at 1471MB, so this might have been a false alarm. I do see my rrdcached process slowly increases, so the leak might be in there.

@bs-github bs-github mentioned this pull request Nov 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants