New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TS-4897: Unbound growth of number of memory maps under SSL termination load when ssl_ticket_enabled=0 #1050
Conversation
…nder SSL termination load when ssl_ticket_enabled=0
So it sounds like https://github.com/apache/trafficserver/blob/master/lib/ts/ink_memory.cc#L194-L201 |
If you look at the glibc implementation of I assume that you observe the effect when using the |
I remember seeing the glibc implementation of it basically doing the We were also suspecting transparent huge pages but we can elaborate on that potential explanation if this one ends up having shortcomings. -When Calls to
Calls to
Even with the non-stop calls to This page (https://www.linux.com/manpage/man2/madvise.2.html) says the following about the
And the backtrace for the
|
@canselcik So if you replace One explanation for the difference in memory maps is that |
I've done some further testing (including with
So basically it looks like you can call Any ideas? When we omit the call to either The fact that we are having the issue despite |
Based on point 3, this sounds like a kernel bug. Which kernel version are you running? |
Yeah, I'm starting to get that impression as well. Observed the issue under:
Both are RHEL so there might be some patches applied to them. I'm looking into getting the source for the build. Not that |
I believe @zwoop has a patch for that somewhere. |
A patch for ATS targeting some issues while running on 2.6.32 or a patch for making |
Patch for making the advice configurable. |
I believe that @smalenfant has seen this same thing and has also observed that upgrading kernels fixes it. |
I've been doing lots of testing and sharing results with @PSUdaemon. I'm seeing the issue when using fairly big ram_cache with the normal pages under Centos 6. I'm not using SSL at all. I've been troubleshooting this since I posted the Test system was multiple Dell R720/730xd with 24 disks and 192GB of memory.
Using Centos 6.8 Then, trying to stop traffic server would take about an hour and there Now, I then tested with Centos 7 under the same conditions. I didn't see the CPU Spikes and didn't experience the traffic server stop issue. To make this work with Centos 6, I then configured Huge Pages to stop using anonymous pages. I've set the amount of reserved Huge Pages to 48000 (96GB) and gave it a surplus of 8000 pages (16GB).
You also have to enable Huge Pages Reboot the server to ensure the pages gets allocated without being fragmented (reserved one). You can then observe the problem is gone (hopefully) and also that the PageTables size is kept to a reasonable size. (Was >200,000kB before).
Not sure how related to madvise, but those are the way to get it working and keep performance at a good level. Feel free to comment or ask questions. |
This also started appearing once |
@PSUdaemon I am yes. Variant of 5.3.2 with the HugePage + madvise patch. |
FWIW, I have some crash log on an 3.13 (ubuntu) kernel that have an extremely large number of entries in |
See also PR #1097 |
Should we close this now? |
It's fixed elsewhere? |
We feel this has been solved by #1097. Please open a new PR if you feel this is not the case. |
The number of
[anon]
memory regions mapped to thetraffic_server
process displays unbound growth until the kernel thresholds are reached and the process is terminated.This happens when ATS is used to terminate SSL and
ssl_ticket_enabled=0
inssl_multicert.config
.We've experienced this issue on our staging and production hosts and were able to replicate it with the above configuration under high volume HTTPS load. We didn't experience this with
5.2.x
and it will make sense why at the end.While generating
https
traffic withsiege
orab
, the issue can be observed with:watch "pmap $(pidof traffic_server) | wc -l"
git bisect
pointed us to: [TS-3883: Fix madvise]Turns out a no-op
ats_madvise
hides the symptoms of the issue.Going in deeper, we realize that
ssl_ticket_enabled
option is relevant because after enabling thessl.session_cache
tag, we see that ATS doesn't manage its own session cache for SSL, it is done by the library instead. In that case, the code path doing the problematic allocation within ATS doesn't get executed often since OpenSSL takes care of the session tokens.But why does this happen? It happens because
MADV_DONTDUMP
is passed toposix_madvise
even thoughMADV_DONTDUMP
is not a valid flag forposix_madvise
as it is not a drop-in replacement tomadvise
.Looking at
<bits/mman.h>
:However
posix_madvise
takes:Also
posix_madvise
andmadvise
can both be present on the same system. However they do not have the same capability. That's whyExplicity exclude from the core dump, overrides the coredump filter bits
functionality isn't achievable throughposix_madvise
.ASF JIRA Reference: https://issues.apache.org/jira/browse/TS-4897