Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maybe there exsist a unknown memory leak #2033

Closed
dllhlx opened this issue Nov 11, 2016 · 9 comments
Closed

Maybe there exsist a unknown memory leak #2033

dllhlx opened this issue Nov 11, 2016 · 9 comments

Comments

@dllhlx
Copy link

dllhlx commented Nov 11, 2016

  • Version of collectd: collectd 5.4
  • Operating system / distribution: centos 6

Expected behavior

collectd's ps_rss is stable at the running time

Actual behavior

several machine's processes-collectd.ps_rss is gradually increased in the last 12 hours, the graphite picture as below:
collectd-debug

Form the graph, we can see the memory leak although is small but can not be ignored.

Steps to reproduce

collectd's configure is:

FQDNLookup   false

TypesDB "/opt/collectd/share/collectd/types.db"

Interval 30
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk
LoadPlugin interface
LoadPlugin load
LoadPlugin memory
LoadPlugin swap
LoadPlugin tcpconns
LoadPlugin uptime
LoadPlugin users

LoadPlugin processes
<Plugin processes>
    ProcessMatch "collectd" "collectd"
</Plugin>

LoadPlugin statsd
<Plugin statsd>
    Host "127.0.0.1"
    Port "12430"
    TimerPercentile 90.0
</Plugin>

LoadPlugin logfile
<Plugin logfile>
    LogLevel info
    File "/var/log/collectd.log"
    Timestamp true
    PrintSeverity false
</Plugin>

LoadPlugin match_regex
<Chain PostCache>
  <Rule "ignore_disk_metrics">
    <Match "regex">
      Plugin "^disk"
      PluginInstance "^dm.*"
    </Match>
    Target "stop"
  </Rule>

  <Rule "ignore_tcpcon_metrics">
    <Match "regex">
      Plugin "^tcpconns"
      PluginInstance "^[0-9]+-local$"
    </Match>
    Target "stop"
  </Rule>

  <Rule "ignore_ipmi_metrics">
    <Match "regex">
      Plugin "^ipmi"
      Type "^fanspeed.*"
    </Match>
    Target "stop"
  </Rule>

  # Default target
  Target "write"
</Chain>

LoadPlugin write_graphite
<Plugin write_graphite>
  <Node "test">
    Host "xx.xxx.xxx"
    Port "2016"
    Protocol "udp"
    Prefix "h."
  </Node>
</Plugin>

@dllhlx
Copy link
Author

dllhlx commented Nov 17, 2016

@octo after I change the config as below:

FQDNLookup   false

TypesDB "/opt/collectd/share/collectd/types.db"

Interval 30
LoadPlugin cpu
LoadPlugin df
LoadPlugin disk

LoadPlugin processes
<Plugin processes>
    ProcessMatch "collectd" "collectd"
</Plugin>

LoadPlugin statsd
<Plugin statsd>
    Host "127.0.0.1"
    Port "12430"
    TimerPercentile 90.0
</Plugin>

LoadPlugin logfile
<Plugin logfile>
    LogLevel info
    File "/var/log/collectd.log"
    Timestamp true
    PrintSeverity false
</Plugin>

LoadPlugin match_regex
<Chain PostCache>
  <Rule "ignore_disk_metrics">
    <Match "regex">
      Plugin "^disk"
      PluginInstance "^dm.*"
    </Match>
    Target "stop"
  </Rule>

  <Rule "ignore_tcpcon_metrics">
    <Match "regex">
      Plugin "^tcpconns"
      PluginInstance "^[0-9]+-local$"
    </Match>
    Target "stop"
  </Rule>

  <Rule "ignore_ipmi_metrics">
    <Match "regex">
      Plugin "^ipmi"
      Type "^fanspeed.*"
    </Match>
    Target "stop"
  </Rule>

  # Default target
  Target "write"
</Chain>

LoadPlugin write_graphite
<Plugin write_graphite>
  <Node "test">
    Host "xx.xxx.xxx"
    Port "2016"
    Protocol "udp"
    Prefix "h."
  </Node>
</Plugin>

And I restart the collectd, the collectd.ps_rss become stable all the time.

On the other side, I add DEBUG() after malloc/calloc/realloc function, and I find no memory malloc/calloc/realloc be called when the collectd.ps_rss increase. So strange.

@octo
Copy link
Member

octo commented Nov 18, 2016

Thanks for the update @dllhlx! Reproducing this issue has proven difficult and this way we can focus on the difference in the config.

@dllhlx
Copy link
Author

dllhlx commented Nov 21, 2016

@octo there is a graph show the memory changed clear as below:
leak1
We can see, the collectd's rss jump from 3.5M to 13M.
And there is also another machine's graph show the ps-rss increasing same as above:
leak2
and the ps-vm alse jumped:
leak3

@dllhlx
Copy link
Author

dllhlx commented Nov 21, 2016

update:
Besides I added print function behind the malloc/alloc/realloc function, I print out all the structs(read_heap, cache_tree, list_write, write_queue_t) size/length in each running cycle , and I find these structs's size keep stable even though the ps-rss is increasing.

@dllhlx
Copy link
Author

dllhlx commented Nov 21, 2016

Another leak reported by valgrind after ran cmd valgrind --tool=memcheck --leak-check=full /home/q/collectd/sbin/collectd -f end with Ctrl+C:

==22085== 
==22085== HEAP SUMMARY:
==22085==     in use at exit: 260,513 bytes in 1,473 blocks
==22085==   total heap usage: 4,667 allocs, 3,194 frees, 6,911,937 bytes allocated
==22085== 
==22085== 8 bytes in 1 blocks are definitely lost in loss record 30 of 252
==22085==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==22085==    by 0x3679680E91: strdup (in /lib64/libc-2.12.so)
==22085==    by 0x418799: yyparse (parser.y:101)
==22085==    by 0x415D71: oconfig_parse_fh (oconfig.c:60)
==22085==    by 0x415E4C: oconfig_parse_file (oconfig.c:90)
==22085==    by 0x409666: cf_read_generic (configfile.c:643)
==22085==    by 0x40A2DC: cf_read (configfile.c:1086)
==22085==    by 0x405E98: main (collectd.c:470)
==22085== 
==22085== 48 bytes in 1 blocks are definitely lost in loss record 110 of 252
==22085==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==22085==    by 0x4092B0: cf_read_generic (configfile.c:807)
==22085==    by 0x40A2DC: cf_read (configfile.c:1086)
==22085==    by 0x405E98: main (collectd.c:470)
==22085== 
==22085== 1,200 bytes in 1 blocks are definitely lost in loss record 230 of 252
==22085==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==22085==    by 0x4A06BA2: realloc (vg_replace_malloc.c:662)
==22085==    by 0x4091F5: cf_ci_append_children (configfile.c:531)
==22085==    by 0x409705: cf_read_generic (configfile.c:854)
==22085==    by 0x409899: cf_read_generic (configfile.c:591)
==22085==    by 0x40A2DC: cf_read (configfile.c:1086)
==22085==    by 0x405E98: main (collectd.c:470)
==22085== 
==22085== LEAK SUMMARY:
==22085==    definitely lost: 1,256 bytes in 3 blocks
==22085==    indirectly lost: 0 bytes in 0 blocks
==22085==      possibly lost: 0 bytes in 0 blocks
==22085==    still reachable: 259,257 bytes in 1,470 blocks
==22085==         suppressed: 0 bytes in 0 blocks
==22085== Reachable blocks (those to which a pointer was found) are not shown.
==22085== To see them, rerun with: --leak-check=full --show-reachable=yes
==22085== 
==22085== For counts of detected and suppressed errors, rerun with: -v
==22085== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 6 from 6)

As description abow, there are three leak, but I am not sure this leak report - definitely lost if true or not.

@rubenk
Copy link
Contributor

rubenk commented Dec 8, 2016

@dllhlx any chance you can reproduce this on a recent version of collectd? We fixed a lot of leaks since 5.4. If you need binary packages for centos 6, we have them at https://github.com/collectd/collectd-ci

@rubenk
Copy link
Contributor

rubenk commented Feb 19, 2017

@dllhlx ping?

@dllhlx
Copy link
Author

dllhlx commented Feb 27, 2017

@rubenk sorry so late to feedback. I will test collectd-5.7 some time later, then any change will feedback.

@rubenk
Copy link
Contributor

rubenk commented May 5, 2017

Closing stale issue.

@rubenk rubenk closed this as completed May 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants