-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible memory leak (or misconfiguration) #2929
Comments
When I configured the different queue, I had the assumption that when not in use, queue would not consume any memory. It seems I was wrong, queues allocate memory when needed, but the allocated space is never unallocated. I am in an use case when devices can send short very high burst of logs, that is why I set the different buffer to very high values; but, as space is never recovered, that is not a viable option. Test 1 : Test 1 : Test 3 : |
@jbfuzier : thanks for the report, we need some time to investigate the issue and check your config |
Thanks, I have seen in some older post, that jemalloc can improve memory management a lot. I am in the process of trying this approch (someone pointed in previous issues that malloc sometime does not hand memory back to the kernel, which look a lot like what I am seeing). I posted a question about jemalloc on the mailing list as I am having issue using it dynamically (LD_PRELOAD is disabled for ELF with capabilities or setuid :( ). Unrelated topic, is it you guys that are providing support through oneidentity ? I am considering asking my boss to purchase commercial support, if you have the name of a good company that could offer that ? |
Well... 29GB of memory is 29GB of memory... I think replacing the memory allocator won't solve your issue. Of course we can give it a try. But I think the root cause is something else. Yes, we are those guys :) I connect you with the right person if you need more information on commercial support. |
Thanks, I have a meeting with someone from oneidentity this afternoon. |
I managed to get jmalloc running and runned some profiling with the following config :
Also output from pmap & /proc/x/smaps : syslognd_smaps.txt However I am not sure how to interpret those results or even if it gives some usefull information ? |
Hmm, maybe the scratch buffers garbage collection is not running in the
kafka thread?
…On Tue, Sep 24, 2019, 11:56 jbfuzier ***@***.***> wrote:
I managed to get jmalloc running and runned some profiling with the
following config :
export
MALLOC_CONF="prof:true,prof_prefix:jeprof.out,stats_print:true,lg_prof_interval:30"
/logs/lvapp/bin/jeprof --show_bytes /logs/lvapp/sbin/syslog-ng jeprof110.out.1414419.1362.i1362.heap
Using local file /logs/lvapp/sbin/syslog-ng.
Using local file jeprof110.out.1414419.1362.i1362.heap.
Welcome to jeprof! For help, type 'help'.
(jeprof) top
Total: 5892422190 B
5829225430 98.9% 98.9% 5829225430 98.9% g_malloc0
41146818 0.7% 99.6% 41146818 0.7% je_prof_backtrace
19947914 0.3% 100.0% 19947914 0.3% g_realloc
1573632 0.0% 100.0% 1573632 0.0% g_slice_set_config
528394 0.0% 100.0% 528394 0.0% tls_get_addr_tail
0 0.0% 100.0% 5879691955 99.8% __clone
0 0.0% 100.0% 47059268 0.8% __iv_event_run_pending_events
0 0.0% 100.0% 12730234 0.2% __libc_start_main
0 0.0% 100.0% 528394 0.0% __tls_get_addr
0 0.0% 100.0% 1048832 0.0% _counter_group_logpipe_init
/logs/lvapp/bin/jeprof --show_bytes --dot /logs/lvapp/sbin/syslog-ng jeprof110.out.1414419.1362.i1362.heap
[image: graphviz (1)]
<https://user-images.githubusercontent.com/4688432/65501615-94eb1700-dec1-11e9-98d5-a332947d285e.png>
Also output from pmap & /proc/x/smaps :
syslognd_smaps.txt
<https://github.com/syslog-ng/syslog-ng/files/3646547/syslognd_smaps.txt>
syslog_ng_pmap.txt
<https://github.com/syslog-ng/syslog-ng/files/3646548/syslog_ng_pmap.txt>
However I am not sure how to interpret those results or even if it gives
some usefull information ?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2929?email_source=notifications&email_token=AAFOK5RQY5DF3RAYLQNPK5LQLHP4BA5CNFSM4IXNSPUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7NZTKI#issuecomment-534485417>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFOK5UPR3NTFI5GU3DAZNTQLHP4BANCNFSM4IXNSPUA>
.
|
we have a comment around g_string_steal: " we passed the allocated buffers to rdkafka, which will eventually free them " ... is it possible that rdkafka won't free up the memory? what is the exact version of the librdkafka? |
Another question, is syslog-ng publishing to kafka successfully? because if not, then it will keep a lot of messages in-flight, a lot more than syslog-ng buffers in its queue. that would also explain why btw: you can control how many messages librdkafka buffers with its configuration settings, that is described here: https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md |
Kafka output is consistent with syslogng input.
librdkafka is at v1.1.0 My buffers are configured this way :
Which in worst case with 4k messages would results in 4GB (memory usage is growing, I reached 29GB). |
According to librdkafka contract, the flag RD_KAFKA_MSG_F_FREE asks rdkafka to release memory UNLESS rd_kafka_produce returns -1 : https://github.com/syslog-ng/syslog-ng/blob/master/modules/kafka/kafka-dest-worker.c#L64 |
there's a potential fix for this in #2942 I didn't have time to test it, so don't put it in production. The issue was in the syslog-ng side. |
You are right, and indeed if rd_kafka_produce() returns with an error, we would keep the allocated memory in self->message and self->key to be used at the next publish. The issue was the condition on when to steal the key() data. btw: there's even a workaround, specify a key() option, possibly to the empty string (that should be the same as passing NULL to rd_kafka_produce), in that case, your leak should be stopped. |
@bazsi Thanks for the fix and the explaination :) |
Setting the key seems to indeed mitigate the issue. |
I managed to reproduce the issue.
This way, none of the messages were successfully forwarded to kafka. I have seen a memory usage of So I think there might be other issues as well, still investigating. |
I'm reopening the issue, because #2942 might not be enough to stop leaking the memory (see #2929 (comment)). |
beware of malloc overheads. the minimum allocated size is 16 bytes and do
mind about fragmentation. I still think this can actually close the entire
leak, so it would be great if @jbfuzier could confirm.
…On Tue, Sep 24, 2019 at 4:44 PM László Várady ***@***.***> wrote:
I'm reopening the issue, because #2942
<#2942> might not be enough to
stop leaking the memory (see #2929 (comment)
<#2929 (comment)>
)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2929?email_source=notifications&email_token=AAFOK5TF4OVQKN4VN3S5UDTQLIRWTA5CNFSM4IXNSPUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7OTYFI#issuecomment-534592533>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFOK5RYHLQZIO3HCNTLN6LQLIRWTANCNFSM4IXNSPUA>
.
--
Bazsi
|
Thanks @bazsi for the heads up! I ran 2 tests: 1 with the fix included, and 1 without the fix:
I guess the fix is enough, but I will wait for @jbfuzier to confirm this, too. :) |
I checked, and without the fix, using
With the fix, and no |
librdkafka doesn't seem to free the memory for key.
…On Wed, Sep 25, 2019, 14:43 Attila Szakacs ***@***.***> wrote:
I checked, and without the fix, using key("$R_MSEC") it still leaks:
==13806== 36 bytes in 9 blocks are definitely lost in loss record 30 of 88
==13806== at 0x4C2FD5F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13806== by 0x516B877: g_realloc (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4800.2)
==13806== by 0x5186D06: g_string_maybe_expand (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4800.2)
==13806== by 0x5186F4E: g_string_set_size (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4800.2)
==13806== by 0x4E90D3F: format_padded_int32 (str-format.c:236)
==13806== by 0x4EB1F83: log_macro_expand_date_time (macros.c:367)
==13806== by 0x4EB1F83: log_macro_expand (macros.c:635)
==13806== by 0x4EB0FC9: log_template_append_format_with_context (templates.c:221)
==13806== by 0x4EB1246: log_template_append_format (templates.c:273)
==13806== by 0xD171C6B: _format_message_and_key (kafka-dest-worker.c:49)
==13806== by 0xD171C6B: kafka_dest_worker_insert (kafka-dest-worker.c:144)
==13806== by 0x4ECA2A0: log_threaded_dest_worker_insert (logthrdestdrv.h:192)
==13806== by 0x4ECA2A0: _perform_inserts (logthrdestdrv.c:399)
==13806== by 0x4ECA2A0: _perform_work (logthrdestdrv.c:511)
==13806== by 0x4ECBB9E: iv_run_tasks (iv_task.c:48)
==13806== by 0x4ECF54B: iv_main (iv_main_posix.c:99)
...
==13806== 2,048 bytes in 1 blocks are definitely lost in loss record 78 of 88
==13806== at 0x4C2DB8F: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13806== by 0x4C2FDEF: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13806== by 0x516B877: g_realloc (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4800.2)
==13806== by 0x5186D06: g_string_maybe_expand (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4800.2)
==13806== by 0x5186D51: g_string_sized_new (in /lib/x86_64-linux-gnu/libglib-2.0.so.0.4800.2)
==13806== by 0xD171F05: kafka_dest_worker_new (kafka-dest-worker.c:175)
==13806== by 0x4ECADBE: _construct_worker (logthrdestdrv.c:887)
==13806== by 0x4ECADBE: _start_worker_thread (logthrdestdrv.c:907)
==13806== by 0x4ECADBE: log_threaded_dest_driver_start_workers (logthrdestdrv.c:1048)
==13806== by 0x4E77BBD: log_pipe_init (logpipe.h:291)
==13806== by 0x4E77BBD: cfg_tree_start (cfg-tree.c:1390)
==13806== by 0x4E71DF2: cfg_init (cfg.c:347)
==13806== by 0x4E88EB7: main_loop_initialize_state (mainloop.c:203)
==13806== by 0x4E894EF: main_loop_read_and_init_config (mainloop.c:611)
==13806== by 0x401B24: main (main.c:285)
With the fix, and no key($R_MSEC"), it does not leak.
Although, with the fix and with key($R_MSEC") it leaks again. I will
check it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2929?email_source=notifications&email_token=AAFOK5QHOJJUJ7LDNX2JVW3QLNMIJA5CNFSM4IXNSPUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7RYF5Q#issuecomment-535003894>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFOK5VCLJIBW3S2GP6KYKDQLNMIJANCNFSM4IXNSPUA>
.
|
Based on the docs, only the payload is freed by the library:
|
Opened a PR to fix it. |
syslog-ng
Version of syslog-ng
Platform
rhel6.10
Issue
Failure
Hi,
I switched from v3.9 to v3.23.
The daemon is handling a large amont of UDP logs (~10KEPS ) from many sources.
Everything is written to :
1/ A single big file
2/ One file per equipment sending logs
3/ Kafkac output
The config file is quite complex and it is hard to reproduce the issue in lab without the real trafic making this hard to debug. As it is in prod, it is also very hard to do try and error with the config.
I used pmap -x on 2 occasion to track the change in the allocated memory, between the run there were several new 131072KB annon pages allocated. I dumped one of those pages with gdb, it contains lots of log data from different equipments, however I have no idea how to track with part or the code or which component did this allocation.
Ram usage is after 1 day at 29GB.
Configuration
The text was updated successfully, but these errors were encountered: