-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize epoll_wait loop to avoid memory churn. #80
Optimize epoll_wait loop to avoid memory churn. #80
Conversation
We're investigating using NuProcess in Bitbucket Server, so I put together a simple jmh benchmark to try and get a feel for its performance when pumping a large amount of data from a process (serving Git hosting operations can require pumping an enormous amount of data). My benchmark was very simple: public class NuProcessBenchmark {
public static void main(String[] args) throws Exception {
System.setProperty("com.zaxxer.nuprocess.threads", "cores");
Options options = new OptionsBuilder()
.addProfiler(GCProfiler.class)
//.forks(0) //Set when running with JProfiler
.forks(2)
.jvmArgsAppend("-Dcom.zaxxer.nuprocess.threads=cores")
.include(NuProcessBenchmark.class.getSimpleName())
//.measurementIterations(20) // Set when running with JProfiler
.measurementIterations(50)
//.threads(5) // Set when running with JProfiler
.threads(10)
.timeUnit(TimeUnit.SECONDS)
.warmupIterations(1)
.build();
new Runner(options).run();
}
@Benchmark
@Test
public void cat() throws Exception {
NuProcessBuilder builder = new NuProcessBuilder("cat", "/tmp/random.dat");
builder.setProcessListener(new NuAbstractProcessHandler() {
private final CRC32 crc32 = new CRC32();
@Override
public void onExit(int statusCode) {
assertEquals(3407529827L, crc32.getValue());
}
@Override
public void onStdout(ByteBuffer buffer, boolean closed) {
crc32.update(buffer);
}
});
NuProcess process = builder.start();
assertEquals(0, process.waitFor(1L, TimeUnit.DAYS));
}
} ( When running the benchmark, I noticed that the GC churn appeared to be extremely high, triggering heavy GC. (jmh showed ~34MB/s churn, compared to <200KB/s churn from a basically identical benchmark using our existing processutils library) Running JProfiler (my benchmark above has some comments with the values I used when profiling), analyzing allocation hotspots showed JNA is the primary culprit, triggering well over 500MB in allocations, as well as allocating a fair number of With the changes above, profiling the same benchmark now shows: Notice that:
This reduces the reported churn rate from 34MB/s to 14MB/s, which results in significantly less GC runs. The churn rate is extremely important to Bitbucket Server because we run an enormous number of Git processes and, on larger servers, can easily have over 100 (Edit: For reference, the benchmark was run on a Linux Mint 18.3 x86_64 VM, with 8 VCPUs (AMD Threadripper 1950X) and 16GB of RAM, using Oracle Java 8u151 and Maven 3.3.9.) |
// per bits/epoll.h, epoll_event is created with __attribute__ ((__packed__)), which disables | ||
// applying padding to optimize alignment. epoll_event is memory-aligned on 32-bit platforms, | ||
// but not on 64-bit platforms (i.e. it uses 32-bit alignment on 64-bit platforms) | ||
super(ALIGN_NONE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still doing some testing on 32-bit Linux to further verify this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us know when you've finished 32-bit verification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, took a little bit of setup (I didn't have a 32-bit VM on hand) but I've just finished up testing it now. All of the unit tests pass on x86 Ubuntu 14.04. Once I've added the benchmark I wrote to the project I'll run that on my 32-bit environment as well for a little more stress testing.
Details for my 32-bit environment:
bturner@ubuntu:~/Development/oss/nuprocess$ uname -a
Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:06:37 UTC 2016 i686 athlon i686 GNU/Linux
bturner@ubuntu:~/Development/oss/nuprocess$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-linux-gnu/4.8/lto-wrapper
Target: i686-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04.3' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-i386 --with-arch-directory=i386 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-targets=all --enable-multiarch --disable-werror --with-arch-32=i686 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=i686-linux-gnu --host=i686-linux-gnu --target=i686-linux-gnu
Thread model: posix
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
A slightly modified version of my sizeof
program from #79 confirms (as expected) that the sizeof(struct epoll_event)
is 12 on x86 Linux as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run my jmh benchmark on my 32-bit VM now. Here's the summary:
# Run complete. Total time: 00:04:07
Benchmark Mode Cnt Score Error Units
NuProcessBenchmark.cat thrpt 100 20.496 ± 0.486 ops/s
NuProcessBenchmark.cat:·gc.alloc.rate thrpt 100 7.703 ± 0.183 MB/sec
NuProcessBenchmark.cat:·gc.alloc.rate.norm thrpt 100 540022.366 ± 472.536 B/op
NuProcessBenchmark.cat:·gc.churn.Metaspace thrpt 100 ≈ 10⁻⁴ MB/sec
NuProcessBenchmark.cat:·gc.churn.Metaspace.norm thrpt 100 2.786 ± 6.647 B/op
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space thrpt 100 6.662 ± 4.691 MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm thrpt 100 518120.933 ± 367662.475 B/op
NuProcessBenchmark.cat:·gc.churn.PS_Old_Gen thrpt 100 0.520 ± 0.884 MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Old_Gen.norm thrpt 100 40171.064 ± 67837.749 B/op
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space thrpt 100 0.057 ± 0.075 MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm thrpt 100 4530.189 ± 5891.110 B/op
NuProcessBenchmark.cat:·gc.count thrpt 100 29.000 counts
NuProcessBenchmark.cat:·gc.time thrpt 100 4980.000 ms
NuProcessBenchmark.cat:·threads.alive thrpt 100 18.460 ± 0.170 threads
NuProcessBenchmark.cat:·threads.daemon thrpt 100 17.460 ± 0.170 threads
NuProcessBenchmark.cat:·threads.started thrpt 100 19.000 threads
20 ops/sec, given each op is a complete cat
of a 250MB test file, translates to a throughput of roughly 5GB/s (using the default 4 pump threads, given my VM has 8 VCPUs). Not bad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not bad. If you need higher throughput, you can experiment with calling fcntl with F_SETPIPE_SZ on the pipe. The Linux default pipe size is 64K, but can be set from user-space (via fcntl) up to 1MB. There’s probably a sweet spot in there, maybe between 128-256K.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course you would need to adjust the NuProcess buffer sizes to match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, thinking about it, something like “cat” pretty much produces constant output, but a git command’s output is likely to be quite “bursty”. A burst of output larger than 64K will stall the git process until NuProcess can drain the pipe. So setting a larger pipe size might have good payoff there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git's an interesting case, because it doesn't behave quite how one might expect. For hosting operations (which, for obvious reasons, is what I care about the most here) it tends to reach a "steady state" fairly quickly. After the ref advertisement, it generates a pack with the requested objects and from there it's just streaming that pack. The wire protocol uses a packet format, which is bounded to 8K. That means it writes at most a shade over 8K (8197 bytes) at a time.
We support two different transports, HTTP (Tomcat) and SSH (Mina SSHD). Tomcat uses an 8K buffer by default for HttpServletResponse
, so we use an 8K buffer when copying streams. For SSH, specifically because of Git's packet sizes, we're using an exact 8197 byte buffer to copy streams. These buffers effectively carry any backpressure from the socket through to the underlying Git process. If the socket can't move the data fast enough, the Git process gets blocked trying to write more output and slows down to match.
Of course, as the pump threads fall behind (which they almost certainly would, since they'd be marshaling over to a socket of some sort), while Git might not be writing more than ~8K per shot, that doesn't mean there can't be multiple packets waiting for the pump thread to read. But since those larger reads are still going to be funneled through a buffer somewhere around 8K on their way to the client, those larger reads may not offer much benefit (and, in fact, doing smaller, more packet-sized reads might allow the pump thread to dump output to more sockets per second, keeping perceived transfer rates more steady).
Setting hosting aside, a lot of what Bitbucket Server spends the rest of its time doing is running git rev-list
, git diff-tree
and git diff
, with a few other commands interwoven here and there. For those commands, it's entirely possible they won't produce 64K of output over their entire execution, let alone in a single shot.
If anything, then, I think what we'd really want is a way to change NuProcess.BUFFER_CAPACITY
from a hard constant to something that can, at least to some extent, be changed at runtime. The most flexible option would probably be to make it configurable on the NuProcessBuilder
and use NuProcess.BUFFER_CAPACITY
as a default, but even making it configurable via system property would let the application do some tuning.
// technically this union should have a "Pointer ptr" field, but, for how EpollData is actually | ||
// used, only referencing the "fd" field, it's nothing but overhead. JNA will end up constructing | ||
// them as part of ProcessEpoll's execution, but they never get used | ||
//public Pointer ptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this completely avoids Pointer
allocations in ProcessEpoll.process
public static final int EPOLL_CTL_MOD = 3; /* Remove a file decriptor from the interface. */ | ||
public static final int EPOLL_CTL_ADD = 1; /* Add a file descriptor to the interface. */ | ||
public static final int EPOLL_CTL_DEL = 2; /* Remove a file descriptor from the interface. */ | ||
public static final int EPOLL_CTL_MOD = 3; /* Change file descriptor epoll_event structure. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments on EPOLL_CTL_MOD
and EPOLL_CTL_DEL
were backwards, and all three had "descriptor" misspelled
@@ -34,21 +34,21 @@ | |||
|
|||
public static native int epoll_ctl(int epfd, int op, int fd, EpollEvent event); | |||
|
|||
// technically, events should be an array, but we only ever call with maxevents = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even to use an array, when using direct binding (Native.register
) the parameter type would still be EpollEvent
, not EpollEvent[]
. Array parameters are not supported with direct binding.
Instead, to allow maxevents > 1
, you'd do allocate an EpollEvent[]
using toArray
and then you'd pass the first element as events
.
eventPool.add(new EpollEvent()); | ||
EpollEvent event = new EpollEvent(); | ||
// disable auto-read for these events, because they're only used to call epoll_ctl | ||
event.setAutoRead(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces SharedMemory
allocations in registerProcess
when (needlessly) copying from native memory to heap memory after calling epoll_ctl
deadPool = new LinkedList<LinuxProcess>(); | ||
// disable auto-read and auto-write for triggeredEvent; process() will explicitly call read() | ||
// only if epoll_wait returns 1, and only call write() before invoking epoll_ctl | ||
triggeredEvent.setAutoSynch(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This reduces SharedMemory
allocations in process
:
- The contents of the structure aren't read by
epoll_wait
, only written to, so they're irrelevant and awrite()
is nothing but overhead - If
epoll_wait
returns <1, there's no point in reading the native memory back onto the heap - Native memory is explicitly written prior to calling
epoll_ctl
to register a continued interest instdin
availability, but is not read back after the call
@@ -178,9 +184,9 @@ public void closeStdin(LinuxProcess process) | |||
@Override | |||
public boolean process() | |||
{ | |||
Integer stdinFd = null; | |||
Integer stdoutFd = null; | |||
Integer stderrFd = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Integer
and doing a null
check allocates 3 Integer
s per (very tightly looped) process
call. Individually, they don't matter much. But when pumping a large amount of data across a large number of processes, they contribute noticeably to GC churn. Using an int
with a marker value helps avoid those allocations on many process
calls but not on all of them.
ident
is still boxed to callfildesToProcessMap.get
- All three descriptors are boxed if the process becomes eligible for soft-exit cleanup
linuxProcess.getStderr().release(); | ||
} | ||
} | ||
triggeredEvent.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "cleaner" approach, but the contents of triggeredEvent
at the end of the call don't matter. It's guaranteed the next call will be to epoll_wait
, which will not read those contents--it'll only overwrite them if a descriptor is ready.
checkDeadPool(); | ||
} | ||
} | ||
|
||
// ************************************************************************ | ||
// Private methods | ||
// ************************************************************************ | ||
AtomicInteger count = new AtomicInteger(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
I'm not sure about the Travis failure. That doesn't look like anything I broke. Feel free to ignore my |
(Also, I guess I should note that this will fix #79) |
I love these changes, Thanks! It looks like travis-ci no longer supports "oraclejdk7". Would you mind changing the .travis.yml to point at "openjdk7" instead so that we can get the build green? Everything looks like it's fine on my local box here. |
@lfbayer Thanks for the speedy review! Happy to tweak the Travis config |
Looks like |
It definitely has nothing to do with any of your changes. I think The internets indicate that travis-ci does not give CAP_SYS_ADMIN, which would mean that this call would fail. But it isn't clear to me how the build ever succeeded if that is the case. I'm guessing now that it must be something that changed in travis-ci. But it's strange to me that we even require additional permissions. Unfortunately I'm not really familiar with unshare or CLONE_FS, maybe we're actually using it wrong. It looks like @brettwooldridge is committing some changes that he was sitting on. I've chatted briefly about this issue with him and there are a couple things we will investigate. |
} | ||
else if ((events & LibEpoll.EPOLLOUT) != 0) // Room in stdin pipe available to write | ||
{ | ||
} else if ((events & LibEpoll.EPOLLOUT) != 0) { // Room in stdin pipe available to write |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NuProcess code-style is close-curly on a line by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I'd reverted all the places I changed this. Will fix
If you rebase onto master your travis-ci build should work now (you're likely to have a conflict in .travis.yml). |
@bturner Finally, can you create a "benchmarks" directory and include your jmh benchmark? |
- Disabled autoRead and/or autoWrite on EpollEvent structs depending on how they're used, in ProcessEpoll - EpollEvents in the "eventPool" have autoRead disabled because they are used a write-only context (Calling epoll_ctl) - The "triggeredEvent" EpollEvent has both autoRead/autoWrite disabled - When epoll_wait returns 1, read() is called explicitly, but if it returns 0 native memory is not read - Prior to calling epoll_ctl to register for further stdin events, write() is called explicitly - Fixed memory alignment on EpollEvent and added a unit test to verify the size of the struct is 12 bytes - Removed EpollEvent.EpollData.ptr because, even though it's never used, JNA still allocates it - Replaced Integers with ints using Integer.MIN_VALUE as a marker, when acquiring descriptors, to avoid allocating 3 per process() loop - If the process soft-exits, Integers will still be created then Extra bits: - Added IntelliJ IDEA files to .gitignore - Updated JNA to 4.5.1
a9ee0ed
to
2494eb6
Compare
@brettwooldridge Sure, happy to add that in. Would you like it in a Maven profile to keep it out of the main build? |
@bturner If you have no further changes, I am prepared to merge this. |
Sorry, had to step away to help put my son to bed. I'm just finishing up adding the benchmark. I can open a separate pull request for it if you prefer. |
@bturner No it's fine, no hurry, just wait until the pull request is ready. |
- NuProcessBenchmark takes a given test file and cats it to stdout, computing a CRC32 checksum to test for corruption
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-compiler-plugin</artifactId> | ||
<version>3.7.0</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had updated plugins in the main pom.xml
as well, but to try and keep this change more focused I reverted those updates. I'm happy to push them separately, if you'd like.
Note that I need to use at least 3.5 to use annotationProcessorPaths
in configuration
* That will display the CRC32 checksum in hex, which can then be converted to decimal to match {@code CRC32}'s | ||
* {@code getValue()}. (The calculator on most operating systems can convert hex to decimal.) | ||
*/ | ||
public class NuProcessBenchmark { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish that there was a way to make this a bit more self-contained. I had thought about just having it directly fork head -c 250MB /dev/urandom
, but that relies on the system's ability to generate sufficient entropy. Otherwise it gets extremely slow, which would skew the benchmark. (It also makes it impossible to perform any sort of quick validation on the output to test for corruption, since the checksum would continuously change)
@@ -43,7 +44,7 @@ | |||
<name>Rich DiCroce</name> | |||
</developer> | |||
<developer> | |||
<name>Chris Cladden</name> | |||
<name>Chris Cladden</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just fixed some whitespace here
* | ||
* @param force if <code>true</code> is passed, STDIN will be immediately closed | ||
* even if there are pending writes; otherwise, it will be closed | ||
* after all pending writes are completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed a Javadoc warning about this being missing
Sorry for the delay, @brettwooldridge and @lfbayer . I got hung up on some details that I couldn't get quite how I wanted them. For my benchmark, it requires Java 8 because it uses
|
It’s fine if the benchmark requires Java8. |
@bturner Checkout these memory results after my latest refactor:
Compared to previous (on my Linux VM):
I profiled a jmh run with jProfiler ... @bturner I'd be interested to hear how this affects your throughput numbers. |
@bturner Also fixed up the benchmark to self-calculate the CRC32. |
@bturner The moral of the story seems to be that JNA is extremely inefficient by default, but can be efficient if used carefully. Structures are extremely inefficient, with every field access throwing off And (possibly biggest impact) passing a |
That's an epic improvement, @brettwooldridge! Sensational. My benchmark run now shows:
(Note that while the name is the same this isn't quite the same benchmark as the one in the repository. For consistency I need to keep using my existing benchmark, otherwise this result wouldn't really be a relevant comparison for me.) The throughput shows a marginal improvement, about 2 ops/s but big enough to be statistically relevant, but the real win is definitely in that new allocation rate. As a potentially relevant aside, this is the result of a similar (identical within my ability to make it so) benchmark run against
The allocation rates between the two are really close now, suggesting a transition over to JNR might not really have much "win" to it, from that perspective, now that NuProcess's JNA approach has been so aggressively tuned. |
@bturner Thanks for the feedback! Very encouraging. As I think about the allocation rate, I thought about pooling. Currently, when a process is started and a new instance of NuProcess could likely benefit from a pooling scheme for these buffers, whereby when the process ends, the buffers are returned to a pool, to be picked up by the next process run. A pool could be associated with each processing thread, which would have two benefits. First, because the pool would only be accessed by a single thread, it could be non-concurrent/lock-free. Second, processing threads have a finite lifetime, so if no processes are run for 2.5s (default) the thread terminates and the pool would go with it, allowing memory to be given back during idle periods. During busy periods, this would result in a near zero allocation rate, except for a handful of minor objects; the buffers clearly being the bulk of the memory overhead of a process. |
I burned some midnight oil and implemented pooling of Memory/ByteBuffer objects. Much to my astonishment, the benchmark showed almost no difference at all in terms of allocation rate. Either JNA or the JVM itself is playing a game behind the scenes to reuse the objects in the non-pooling case. I’d like to measure one of the non-jmh tests with some of the heap analysis tools to eliminate jmh ‘s own allocation from the picture. |
autoRead
and/orautoWrite
onEpollEvent
structs depending on how they're used, inProcessEpoll
EpollEvent
s in theeventPool
haveautoRead
disabled because they are used a write-only context (Callingepoll_ctl
)triggeredEvent
EpollEvent
has bothautoRead
/autoWrite
disabledepoll_wait
returns 1,read()
is called explicitly, but if it returns 0 native memory is not readepoll_ctl
to register for further stdin events,write()
is called explicitlyEpollEvent
and added a unit test to verify the size of the struct is 12 bytesEpollEvent.EpollData.ptr
because, even though it's never used, JNA still allocates itInteger
s withint
s usingInteger.MIN_VALUE
as a marker, when acquiring descriptors, to avoid allocating 3 perprocess()
loopInteger
s will still be created thenExtra bits:
.gitignore