Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TS-4897: Unbound growth of number of memory maps under SSL termination load when ssl_ticket_enabled=0 #1050

Closed
wants to merge 1 commit into from

Conversation

canselcik
Copy link
Contributor

@canselcik canselcik commented Sep 27, 2016

The number of [anon] memory regions mapped to the traffic_server process displays unbound growth until the kernel thresholds are reached and the process is terminated.

This happens when ATS is used to terminate SSL and ssl_ticket_enabled=0 in ssl_multicert.config.

We've experienced this issue on our staging and production hosts and were able to replicate it with the above configuration under high volume HTTPS load. We didn't experience this with 5.2.x and it will make sense why at the end.

While generating https traffic with siege or ab, the issue can be observed with:
watch "pmap $(pidof traffic_server) | wc -l"

git bisect pointed us to: [TS-3883: Fix madvise]

Turns out a no-op ats_madvise hides the symptoms of the issue.

Going in deeper, we realize that ssl_ticket_enabled option is relevant because after enabling the ssl.session_cache tag, we see that ATS doesn't manage its own session cache for SSL, it is done by the library instead. In that case, the code path doing the problematic allocation within ATS doesn't get executed often since OpenSSL takes care of the session tokens.

But why does this happen? It happens because MADV_DONTDUMP is passed to posix_madvise even though MADV_DONTDUMP is not a valid flag for posix_madvise as it is not a drop-in replacement to madvise.

Looking at <bits/mman.h>:

     87 /* Advice to `madvise'.  */
     88 #ifdef __USE_BSD
     89 # define MADV_NORMAL▸     0▸    /* No further special treatment.  */
     90 # define MADV_RANDOM▸     1▸    /* Expect random page references.  */
     91 # define MADV_SEQUENTIAL  2▸    /* Expect sequential page references.  */
     92 # define MADV_WILLNEED▸   3▸    /* Will need these pages.  */
     93 # define MADV_DONTNEED▸   4▸    /* Don't need these pages.  */
     94 # define MADV_REMOVE▸     9▸    /* Remove these pages and resources.  */
     95 # define MADV_DONTFORK▸   10▸   /* Do not inherit across fork.  */
     96 # define MADV_DOFORK▸     11▸   /* Do inherit across fork.  */
     97 # define MADV_MERGEABLE▸  12▸   /* KSM may merge identical pages.  */
     98 # define MADV_UNMERGEABLE 13▸   /* KSM may not merge identical pages.  */
     99 # define MADV_DONTDUMP▸   16    /* Explicity exclude from the core dump,
    100                                    overrides the coredump filter bits.  */
    101 # define MADV_DODUMP▸     17▸   /* Clear the MADV_DONTDUMP flag.  */
    102 # define MADV_HWPOISON▸   100▸  /* Poison a page for testing.  */
    103 #endif

However posix_madvise takes:

    107 # define POSIX_MADV_NORMAL▸     0 /* No further special treatment.  */
    108 # define POSIX_MADV_RANDOM▸     1 /* Expect random page references.  */
    109 # define POSIX_MADV_SEQUENTIAL▸ 2 /* Expect sequential page references.  */
    110 # define POSIX_MADV_WILLNEED▸   3 /* Will need these pages.  */
    111 # define POSIX_MADV_DONTNEED▸   4 /* Don't need these pages.  */

Also posix_madvise and madvise can both be present on the same system. However they do not have the same capability. That's why Explicity exclude from the core dump, overrides the coredump filter bits functionality isn't achievable through posix_madvise.

ASF JIRA Reference: https://issues.apache.org/jira/browse/TS-4897

…nder SSL termination load when ssl_ticket_enabled=0
@PSUdaemon PSUdaemon self-assigned this Sep 27, 2016
@PSUdaemon PSUdaemon added this to the 7.1.0 milestone Sep 27, 2016
@PSUdaemon
Copy link
Contributor

PSUdaemon commented Sep 27, 2016

So it sounds like madvise() is a superset of posix_madvise() so we should prefer madvise() if available? So perhaps we should alter ats_madvise() to flip the preference.

https://github.com/apache/trafficserver/blob/master/lib/ts/ink_memory.cc#L194-L201

@zwoop zwoop added the TLS label Sep 27, 2016
@jpeach
Copy link
Contributor

jpeach commented Sep 27, 2016

If you look at the glibc implementation of posix_madvise(), it just makes the madvise system call for the MADV_DONTDUMP advice.

I assume that you observe the effect when using the SSLSessionCache implementation of the session cache? What happens when you use proxy.config.ssl.session_cache=2?

@canselcik
Copy link
Contributor Author

canselcik commented Sep 27, 2016

I remember seeing the glibc implementation of it basically doing the madvise syscall as well but their behavior really do seem to be different on both our production and development systems, which are mostly RHEL6.5 and some RHEL6.7.

We were also suspecting transparent huge pages but we can elaborate on that potential explanation if this one ends up having shortcomings.

-When ats_madvise is a noop, map count doesn't grow unbounded. (regardless of the ssl_ticket_enabled value)
-When this change is made and posix_madvise is called with 0, map count doesn't grow unbounded. (regardless of the ssl_ticket_enabled value)
-If ats_madvise calls madvise instead of posix_madvise (our systems have both), map count doesn't grow unbounded.

Calls to ats_madvise when ssl_ticket_enabled=0:

ats_madvise(addr=0x7fc4bc022000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4b801f000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4cc01f000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4dc017000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4c8005000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4d41e5000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4a4003000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4b0019000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4a8013000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4a4006000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4e0022000, len=32768, flags=16) = 0
ats_madvise(addr=0x1dd7000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4c8028000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4c408b000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4b8003000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4b401c000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4d801d000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4b8028000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4b0022000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4a401e000, len=32768, flags=16) = 0
ats_madvise(addr=0x7fc4a0003000, len=8192, flags=0) = 0
ats_madvise(addr=0x7fc4d41ee000, len=32768, flags=16) = 0
<pattern repeats as long as more connections are accepted and handled>

Calls to ats_madvise when ssl_ticket_enabled=1 (not repeating despite more connections are accepted and handled):

ats_madvise(addr=0x7f05aea43000, len=131072, flags=16) = 0
ats_madvise(addr=0x7f057c003000, len=8192, flags=0) = 0
ats_madvise(addr=0x7f057c01e000, len=4096, flags=0) = 0
ats_madvise(addr=0x7f057c022000, len=20480, flags=0) = 0
ats_madvise(addr=0x7f057c028000, len=94208, flags=0) = 0
ats_madvise(addr=0x7f057c040000, len=28672, flags=0) = 0
ats_madvise(addr=0x7f058c02e000, len=32768, flags=0) = 0
ats_madvise(addr=0x7f05aea01000, len=262144, flags=0) = 0
ats_madvise(addr=0x7f05ae9bf000, len=262144, flags=16) = 0
ats_madvise(addr=0x7f05ae97d000, len=262144, flags=0) = 0
ats_madvise(addr=0x7f055c01d000, len=12288, flags=0) = 0
ats_madvise(addr=0x7f05ae93b000, len=262144, flags=16) = 0
ats_madvise(addr=0x7f05ae8f9000, len=262144, flags=16) = 0
ats_madvise(addr=0x7f05ae8b7000, len=262144, flags=16) = 0
ats_madvise(addr=0x7f055c021000, len=12288, flags=0) = 0

Even with the non-stop calls to ats_madvise in the ssl_ticket_enabled=0 case, if ats_madvise is noop, or uses flags=0 or calls madvise rather than posix_madvise, we don't observe the growth in memory maps but still have the internals of the SSLSessionCache do their usual allocations.

This page (https://www.linux.com/manpage/man2/madvise.2.html) says the following about the MADV_* flags:

The following Linux-specific advicevalues have no counterparts in the POSIX-specified
posix_madvise(3), and may or may not have counterparts in the madvise() interface 
available on other implementations. Note that some of these operations change the 
semantics of memory accesses.

And the backtrace for the ats_madvise(addr=*, len=32768, flags=16) call:

#0  ats_madvise (addr=0x7fffcc062000 "", len=32768, flags=16) at ink_memory.cc:205
#1  0x00007ffff7de8cc3 in freelist_new (f=0x1082800) at ink_queue.cc:215
#2  0x00007ffff7de86cd in ink_freelist_new (f=0x1082800) at ink_queue.cc:183
#3  0x000000000074860e in Allocator::alloc_void (this=0x8) at ../../lib/ts/Allocator.h:63
#4  IOBufferData::alloc (this=0x7fffcc006a70, type=MEMALIGNED, size_index=1) at ../../iocore/eventsystem/P_IOBuffer.h:274
#5  new_IOBufferData_internal (type=MEMALIGNED, size_index=1, loc=0x7c74e0 "memory/IOBuffer/SSLSessionCache.cc:121") at ../../iocore/eventsystem/P_IOBuffer.h:253
#6  IOBufferData_tracker::operator() (type=MEMALIGNED, size_index=1, this=<optimized out>) at ../../iocore/eventsystem/I_IOBuffer.h:1360
#7  SSLSessionBucket::insertSession (this=0x1465768, id=..., sess=0x7fffcc0008e0) at SSLSessionCache.cc:121
#8  0x0000000000748c63 in SSLSessionCache::insertSession (this=<optimized out>, sid=..., sess=0x7fffcc0008e0) at SSLSessionCache.cc:101
#9  0x0000000000723ffd in ssl_new_cached_session (ssl=0x7fffcc01ac60, sess=0x7fffcc0008e0) at SSLUtils.cc:249
#10 0x00007ffff7b70a32 in ssl_update_cache () from /export/apps/openssl/lib/libssl.so.1.0.0
#11 0x00007ffff7b4d885 in ssl3_accept () from /export/apps/openssl/lib/libssl.so.1.0.0
#12 0x0000000000722f6d in SSLAccept (ssl=0x7fffcc01ac60) at SSLUtils.cc:2117
#13 0x0000000000720036 in SSLNetVConnection::sslServerHandShakeEvent (this=0x7fff9c01b280, err=@0x7ffff4fedc3c: 0) at SSLNetVConnection.cc:1097
#14 0x0000000000720b2a in SSLNetVConnection::sslStartHandShake (this=0x7fff9c01b280, event=<optimized out>, err=@0x7ffff4fedc3c: 0) at SSLNetVConnection.cc:1008
#15 0x000000000071f2e1 in SSLNetVConnection::net_read_io (this=0x7fff9c01b280, nh=0x7ffff6002d70, lthread=0x7ffff5fff010) at SSLNetVConnection.cc:461
#16 0x000000000072d180 in NetHandler::mainNetEvent (this=0x7ffff6002d70, event=<optimized out>, e=<optimized out>) at UnixNet.cc:513
#17 0x0000000000759f2d in Continuation::handleEvent (data=0x112c7c0, event=5, this=<optimized out>) at I_Continuation.h:153
#18 EThread::process_event (this=0x7ffff5fff010, e=0x112c7c0, calling_code=5) at UnixEThread.cc:148
#19 0x000000000075a879 in EThread::execute (this=0x7ffff5fff010) at UnixEThread.cc:275
#20 0x000000000075936a in spawn_thread_internal (a=0x1087f20) at Thread.cc:86
#21 0x0000003ee0e079d1 in start_thread () from /lib64/libpthread.so.0
#22 0x0000003ee0ae88fd in clone () from /lib64/libc.so.6

@jpeach
Copy link
Contributor

jpeach commented Sep 27, 2016

@canselcik So if you replace posix_madvise with madvise, whilst retaining the MADV_DONTDUMP flag, you observed no increase in memory maps?

One explanation for the difference in memory maps is that IOBufferData::alloc() does not madvise() it misses the free list allocated sizes. However these sizes are larger than the lengths in your trace, so that's a conundrum. I think a system call trace might be more illuminating?

@PSUdaemon PSUdaemon assigned jpeach and unassigned PSUdaemon Sep 27, 2016
@canselcik
Copy link
Contributor Author

canselcik commented Sep 28, 2016

@jpeach
@PSUdaemon

I've done some further testing (including with strace) and here some corrections and clarifications:

  1. It doesn't matter if ats_madvise calls posix_madvise or madvise.
  2. If posix_madvise is called with the MADV_DONTDUMP flag, it ultimately makes a syscall to madvise with MADV_NORMAL. It sanitizes the flag to MADV_NORMAL if it is greater than 4 basically. In this case we see the growth in memory maps.
  3. If madvise is called with the MADV_DONTDUMP flag, it ultimately makes a syscall to madvise with MADV_DONTDUMP, and we see the growth in memory maps.

So basically it looks like you can call posix_madvise or madvise with MADV_DONTDUMP and this issue surfaces.

Any ideas? When we omit the call to either posix_madvise / madvise or make the call but make it with MADV_NORMAL rather than MADV_DONTDUMP, we don't see the issue.

The fact that we are having the issue despite posix_madvise with MADV_DONTDUMP being translated into madvise with MADV_NORMAL means there must be something else that's done with that flag before ultimately making the syscall with the sanitized flag.

@PSUdaemon
Copy link
Contributor

Based on point 3, this sounds like a kernel bug. Which kernel version are you running?

@canselcik
Copy link
Contributor Author

canselcik commented Sep 29, 2016

Yeah, I'm starting to get that impression as well.

Observed the issue under:

  • 2.6.32-504.el6.x86_64
  • 2.6.32-573.7.1.el6.x86_64

Both are RHEL so there might be some patches applied to them. I'm looking into getting the source for the build.

Not that 2.6.32 ancient but if this behavior is present with more modern kernel versions, perhaps we could make MADV_DONTDUMP behavior configurable?

@PSUdaemon
Copy link
Contributor

I believe @zwoop has a patch for that somewhere.

@canselcik
Copy link
Contributor Author

A patch for ATS targeting some issues while running on 2.6.32 or a patch for making MADV_DONTDUMP configurable?

@PSUdaemon
Copy link
Contributor

Patch for making the advice configurable.

@PSUdaemon
Copy link
Contributor

I believe that @smalenfant has seen this same thing and has also observed that upgrading kernels fixes it.

@smalenfant
Copy link

I've been doing lots of testing and sharing results with @PSUdaemon. I'm seeing the issue when using fairly big ram_cache with the normal pages under Centos 6. I'm not using SSL at all. I've been troubleshooting this since I posted the ATS blocking at half hour interval in the traffic server forum.

Test system was multiple Dell R720/730xd with 24 disks and 192GB of memory.

  • CONFIG proxy.config.cache.ram_cache.size INT 103079215104. => 96GB

Using Centos 6.8 2.6.32-642.4.2 (version doesn't matter, any variant of Centos 6.7 and 6.8), I would run HLS streaming to fill up the entire ram_cache. At some point (few hours) I would start to see system CPU spikes every 22 minutes. If you run a perf top during that time, you can see high activity by try_to_unmap_anon.

Then, trying to stop traffic server would take about an hour and there try_to_unmap_anon would show up during the entire time.

Now, I then tested with Centos 7 under the same conditions. I didn't see the CPU Spikes and didn't experience the traffic server stop issue.

To make this work with Centos 6, I then configured Huge Pages to stop using anonymous pages. I've set the amount of reserved Huge Pages to 48000 (96GB) and gave it a surplus of 8000 pages (16GB).

vm.nr_hugepages = 48000
vm.nr_overcommit_hugepages = 8000

You also have to enable Huge Pages CONFIG proxy.config.allocator.hugepages INT 1.

Reboot the server to ensure the pages gets allocated without being fragmented (reserved one).

You can then observe the problem is gone (hopefully) and also that the PageTables size is kept to a reasonable size. (Was >200,000kB before).

[root@myserver ~]# cat /proc/meminfo | grep Page
AnonPages:       1176352 kB
PageTables:         9836 kB
AnonHugePages:         0 kB
HugePages_Total:   48651
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:      651

Not sure how related to madvise, but those are the way to get it working and keep performance at a good level. Feel free to comment or ask questions.

@PSUdaemon
Copy link
Contributor

This also started appearing once madvise() was added. You are using a version of 5.3 with it backported, correct?

@smalenfant
Copy link

@PSUdaemon I am yes. Variant of 5.3.2 with the HugePage + madvise patch.

@jpeach
Copy link
Contributor

jpeach commented Oct 4, 2016

FWIW, I have some crash log on an 3.13 (ubuntu) kernel that have an extremely large number of entries in /proc/%PID/maps. Maybe this problem is not limited to 2.6 kernels?

@jpeach
Copy link
Contributor

jpeach commented Oct 12, 2016

See also PR #1097

@PSUdaemon
Copy link
Contributor

Should we close this now?

@zwoop
Copy link
Contributor

zwoop commented Dec 1, 2016

It's fixed elsewhere?

@PSUdaemon
Copy link
Contributor

@zwoop: #1097 addresses it as best I think we can.

@PSUdaemon PSUdaemon self-requested a review December 22, 2016 16:51
@PSUdaemon PSUdaemon removed this from the 7.1.0 milestone Dec 22, 2016
@PSUdaemon
Copy link
Contributor

We feel this has been solved by #1097. Please open a new PR if you feel this is not the case.

@PSUdaemon PSUdaemon closed this Dec 22, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants