Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stemcells>3312] syslogd on all VMs has memory leak #1537

Closed
voelzmo opened this Issue Dec 14, 2016 · 19 comments

Comments

Projects
None yet
@voelzmo
Copy link
Member

commented Dec 14, 2016

We're observing syslogd using an excessive amount of memory on all stemcells >3312 (i.e. 3312.x series). This seems to be related to the version of rsyslogd, which got updated from 8.22.0-0adiscon1trusty1 in 3312 to 8.23.0-0adiscon1trusty1 in 3312.3-3312.8.

This prevents us from using most recent 3312.x stemcells which contain important security fixes.

dpkg_l.txt for 3312 stemcell has:

 ii rsyslog                            8.22.0-0adiscon1trusty1           amd64        a rocket-fast system for log processing
ii  rsyslog-gnutls                     8.22.0-0adiscon1trusty1           amd64        TLS protocol support for rsyslog
ii  rsyslog-mmjsonparse                8.22.0-0adiscon1trusty1           amd64        Parsing/handling of CEE/Lumberjack JSON messages in rsyslog
ii  rsyslog-relp                       8.22.0-0adiscon1trusty1

e.g. 3312.8 has:

ii rsyslog                            8.23.0-0adiscon1trusty1           amd64        a rocket-fast system for log processing
ii  rsyslog-gnutls                     8.23.0-0adiscon1trusty1           amd64        TLS protocol support for rsyslog
ii  rsyslog-mmjsonparse                8.23.0-0adiscon1trusty1           amd64        Parsing/handling of CEE/Lumberjack JSON messages in rsyslog
ii  rsyslog-relp                       8.23.0-0adiscon1trusty1 

This update probably happened because the base_os step to install packages on Ubuntu uses a PPA to install rsyslog which just contains 'latest v8 version' of rsyslog.

Here some numbers from random VMs showing that rsyslogd consumes way more memory in version 8.23.0 than it did in version 8.22.0:

version 8.23.0:

# ps -p 17469 -fv
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
17469 ?        Ssl   22:16     86   551 6596752 3961352 51.7 /usr/sbin/rsyslogd

# rsyslogd -version                                                                                                                                             
rsyslogd 8.23.0, compiled with:
        PLATFORM:                               x86_64-pc-linux-gnu
        PLATFORM (lsb_release -d):
        FEATURE_REGEXP:                         Yes
        GSSAPI Kerberos 5 support:              No
        FEATURE_DEBUG (debug build, slow code): No
        32bit Atomic operations supported:      Yes
        64bit Atomic operations supported:      Yes
        memory allocator:                       system default
        Runtime Instrumentation (slow code):    No
        uuid support:                           Yes
        Number of Bits in RainerScript integers: 64

Version 8.22.0:

# ps -p 2088 -fv
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
 2088 ?        Ssl    0:06      1   557 271950 4696  0.1 /usr/sbin/rsyslogd

# rsyslogd -version
rsyslogd 8.22.0, compiled with:
        PLATFORM:                               x86_64-pc-linux-gnu
        PLATFORM (lsb_release -d):
        FEATURE_REGEXP:                         Yes
        GSSAPI Kerberos 5 support:              No
        FEATURE_DEBUG (debug build, slow code): No
        32bit Atomic operations supported:      Yes
        64bit Atomic operations supported:      Yes
        memory allocator:                       system default
        Runtime Instrumentation (slow code):    No
        uuid support:                           Yes
        Number of Bits in RainerScript integers: 64

Interestingly enough, most CF components continue to work, there are however other services that don't cope well if co-located with a memory hog.

@davidelang

This comment has been minimized.

Copy link

commented Dec 14, 2016

does the memory usage continue to increase over time? or is it stable, but just much higher than before?

what's the rsyslog.conf file?

@stephanme

This comment has been minimized.

Copy link

commented Dec 15, 2016

Our 00-forwarder.conf contains:

$ModLoad imuxsock                       # local message reception (rsyslog uses a datagram socket)
$MaxMessageSize 10240 # default is 2k
$WorkDirectory /var/vcap/sys/rsyslog/buffered # where messages should be buffered on disk

# Forward vcap messages to the aggregator
#
$ActionResumeRetryCount -1              # Try until the server becomes available
$ActionQueueType LinkedList             # Allocate on-demand
$ActionQueueFileName agg_backlog        # Spill to disk if queue is full
$ActionQueueMaxDiskSpace 32m            # Max size for disk queue
$ActionQueueLowWaterMark 2000           # Num messages. Assuming avg size of 512B, this is 1MiB.
$ActionQueueHighWaterMark 8000          # Num messages. Assuming avg size of 512B, this is 4MiB. (If this is reached, messages will spill to disk until the low watermark is reached).
$ActionQueueTimeoutEnqueue 0            # Discard messages if the queue + disk is full
$ActionQueueSaveOnShutdown on           # Save in-memory data to disk if rsyslog shuts down

# Listen for logs over UDP
#
$ModLoad imudp
$UDPServerAddress 127.0.0.1cd
$UDPServerRun 514




template(name="CfLogTemplate" type="list") {
        constant(value="<")
        property(name="pri")
        constant(value=">")
        property(name="timestamp" dateFormat="rfc3339")
        
          constant(value=" 10.0.65.66 ")
        
        property(name="programname")
        constant(value=" [job=dummy_z1 index=0] ")
        property(name="msg")
}






*.* @@10.0.68.3:5514;CfLogTemplate






# Log vcap messages locally, too
#$template VcapComponentLogFile, "/var/log/%programname:6:$%/%programname:6:$%.log"
#$template VcapComponentLogFormat, "%timegenerated% %syslogseverity-text% -- %msg%\n"
#:programname, startswith, "vcap." -?VcapComponentLogFile;VcapComponentLogFormat

# Prevent them from reaching anywhere else
:programname, startswith, "vcap." ~

which is basically what metron agent configures. I.e. we send logs via TCP.

We executed the following script on 2 VMs with stemcell 3312.7 and 3312:

while :
do
  logger $RANDOM'TESTTESTSTESTTEST{"timestamp":1481736578.1486158,"message":"(0.000462s) SELECT * FROM \"apps\" WHERE (\"apps\".\"guid\" = a7887e60-6535-4b27-af62-93b11f3f8701) LIMIT 1","log_level":"info","source":"cc.db","data":{"request_guid":"2f5c02f9-aaae-4052-4106-57dfe0548273::8d753b9f-953c-4613-9b7f-71aa9f0b4b49"},"thread_id":47186750830220,"fiber_id":47186770028100'
done

Additional observation: our remote TCP log endpoint (an ELK) is not stable but seems to reset connections quite often.

Using stemcell 3312.7 (rsyslogd 8.23.0) we see increasing memory consumption and the following message in /var/log/syslog:

Dec 15 14:22:11 localhost rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.23.0 try http://www.rsyslog.com/e/2359 ]
Dec 15 14:22:11 localhost rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.23.0 try http://www.rsyslog.com/e/2359 ]
Dec 15 14:22:11 localhost rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.23.0 try http://www.rsyslog.com/e/2359 ]
Dec 15 14:22:11 localhost rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.23.0 try http://www.rsyslog.com/e/2359 ]
Dec 15 14:22:11 localhost rsyslogd: action 'action 0' resumed (module 'builtin:omfwd') [v8.23.0 try http://www.rsyslog.com/e/2359 ]

On stemcell 3312 (rsyslogd 8.22.0), memory consumption is stable and we don't see the messages above in /var/log/syslog.

Therefore our conclusion is that there is a memory leak in rsyslog 8.23.0 when the remote syslog endpoint is not stable.

@tylerschultz

This comment has been minimized.

Copy link
Member

commented Dec 15, 2016

Thank you for this detailed investigation.
We're currently working to build os-images with rsyslog pinned to version 8.22.0-0adiscon1trusty1.

We see that a version 8.24 was recently release, but the Changelog for that version makes no mention of a fix for the issue.

@rgerhards

This comment has been minimized.

Copy link

commented Dec 20, 2016

Would it be possible to run, for a test, rsyslog under valgrind control (best in an interactive session)?

@cppforlife

This comment has been minimized.

Copy link
Member

commented Dec 20, 2016

@voelzmo do you think you can try valgrind in your environment?

@voelzmo

This comment has been minimized.

Copy link
Member Author

commented Jan 3, 2017

@rgerhards You mean with version 8.23 to get a better understanding of where the memory leak is coming from? I can ask out operations people to take a look at that. Any specific options you would like us to set for valgrind so it is most useful for you?

@holgero

This comment has been minimized.

Copy link

commented Jan 23, 2017

I attached the output of an experiment with valgrind on rsyslog 8.23 to the issue there.

@holgero

This comment has been minimized.

Copy link

commented Jan 24, 2017

What I have now to reproduce the problem on bosh-lite:

  • A bosh deployment of the syslog release 9 with the stemcell 3312.9 and haproxy from the cf release 249, here is the manifest: manifest.yml. Note that it won't deploy cleanly as the haproxy misses a pem file and doesn't start successfully.
  • Manual reconfiguration of the haproxy, here is the changed /var/vcap/jobs/haproxy/config/haproxy.conf. Start the haproxy manually on the vm with /var/vcap/jobs/haproxy/bin/haproxy_ctl start.

Then on that vm run a command to produce log entries: while : ; do logger hallo $RANDOM ; sleep .5 ; done & and watch the RES size of the rsyslogd increase slowly...

@jim80net

This comment has been minimized.

Copy link

commented Jun 30, 2017

Let me know if this is similar to what you're seeing.

rsyslog is configured to buffer, but in one of our environments, what happens is that rsyslog

vcap      4430  2.5  0.6 508316 13400 ?        Ssl  17:49   0:00 /usr/sbin/rsyslogd

tries to buffer to /var/vcap/sys/rsyslog/buffered, but it's owned by syslog:adm

6286cb9cbeca:~$ ls -dl /var/vcap/sys/rsyslog/buffered
drwxr-xr-x 2 syslog adm 4096 Jun 27 17:14 /var/vcap/sys/rsyslog/buffered

Resulting in permission errors:

[pid  2840] open("/var/vcap/sys/rsyslog/buffered/agg_backlog.00000001", O_WRONLY|O_CREAT|O_NOCTTY|O_CLOEXEC, 0600) = -1 EACCES (Permission denied)

The way to confirm that this is the case is to validate that the rsyslogd user is vcap, and that the /var/vcap/sys/rsyslog/buffered owner is syslog:adm.

@DenverOps

This comment has been minimized.

Copy link

commented Oct 10, 2017

What is the status of this issue? Has this been resolved?

@tylerschultz

This comment has been minimized.

Copy link
Member

commented Oct 11, 2017

@anEXPer

This comment has been minimized.

Copy link
Member

commented Feb 5, 2018

We'd like to propose releasing said version pin.

We (the team currently maintaining syslog-release need features only present in v8.25 and higher. Since rsyslog's stable version is up to v8.32 now, there's every chance that whatever issue originally provoked the pinning is gone. We attempted the replication described above in order to check, but, unfortunately, have concluded that the replication case is invalid.

The issue is that "slow increase in allocated memory" isn't necessarily a memory leak, especially if the memory allocation:

  • has bounded growth
  • is released according to business/configuration rules

This is what we believe was actually happening in the replication case using HA Proxy above. HA Proxy without a functioning backend accepts and then quickly kills TCP connections, never actually accepting data. When we ran the replication case for longer than ten minutes, the memory usage stabilized and held steady at around 1.4% of available memory. Rsyslog's configured to enqueue actions that haven't been completed in a linked list. If all the replication shows is the growth of action queues, we'd expect the memory reservation to drop as soon as a single good connection was made - and when we performed that experiment, that's exactly what we saw.

Not only that, but the issue "replicates" exactly the same way using the reported method with both the "before/unaffected" v8.22, the originally reported v8.23, and the latest v8.32. Which is good, since it reflects correct functioning of the program.

@davidelang

This comment has been minimized.

Copy link

commented Feb 5, 2018

@voelzmo

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2018

Let's just bump rsyslogd in the stemcells. We've recently had another outage, also related to an rsyslogd memory leak on version 8.22
So I guess it should be fine to just update and see what's happening. It most likely won't get worse than it already is.

@voelzmo

This comment has been minimized.

Copy link
Member Author

commented Jul 20, 2018

/cc @mfine30 Can you create a story to bump rsyslogd in the stemcells to latest stable?

@dpb587-pivotal

This comment has been minimized.

Copy link
Member

commented Jul 30, 2018

@mfine30 FYI, the team did some previous investigation in bumping rsyslog on the 3469.x through story #155528288. The intent was to let people test out that stemcell before bumping it everywhere, but I don't think we ended up getting much feedback on it.

For what it's worth, we're still installing this same rsyslog in the Xenial stemcell (which is probably bad)...

$ wget -qO- https://s3.amazonaws.com/bosh-aws-light-stemcells/light-bosh-stemcell-97-aws-xen-hvm-ubuntu-xenial-go_agent.tgz | tar -xOzf- packages.txt | grep rsyslog
ii  rsyslog                               8.22.0-0adiscon1trusty1                    amd64        a rocket-fast system for log processing
ii  rsyslog-gnutls                        8.22.0-0adiscon1trusty1                    amd64        TLS protocol support for rsyslog
ii  rsyslog-mmjsonparse                   8.22.0-0adiscon1trusty1                    amd64        Parsing/handling of CEE/Lumberjack JSON messages in rsyslog
ii  rsyslog-relp                          8.22.0-0adiscon1trusty1                    amd64        RELP protocol support for rsyslog

We may want to consider if this is something we should bump in a minor version, or if it warrants a full new version.

@voelzmo

This comment has been minimized.

Copy link
Member Author

commented Jul 31, 2018

Given that there hasn't been much feedback around rsyslogd in the past (except from our side, afaik), I'd say just bump it in a minor version. However, this is entirely your call. If you feel more comfortable with a major bump, go with that.

@dpb587-pivotal

This comment has been minimized.

Copy link
Member

commented Aug 6, 2018

In the backlog as Story #159533409

@jfmyers9

This comment has been minimized.

Copy link
Member

commented Aug 21, 2018

Closing this as we have unblocked the pinning of rsyslog. Please feel free to reopen if the memory leak is still an issue.

@jfmyers9 jfmyers9 closed this Aug 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.