Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize epoll_wait loop to avoid memory churn. #80

Merged

Conversation

bturner
Copy link
Collaborator

@bturner bturner commented Jan 12, 2018

  • Disabled autoRead and/or autoWrite on EpollEvent structs depending on how they're used, in ProcessEpoll
    • EpollEvents in the eventPool have autoRead disabled because they are used a write-only context (Calling epoll_ctl)
    • The triggeredEvent EpollEvent has both autoRead/autoWrite disabled
      • When epoll_wait returns 1, read() is called explicitly, but if it returns 0 native memory is not read
      • Prior to calling epoll_ctl to register for further stdin events, write() is called explicitly
  • Fixed memory alignment on EpollEvent and added a unit test to verify the size of the struct is 12 bytes
  • Removed EpollEvent.EpollData.ptr because, even though it's never used, JNA still allocates it
  • Replaced Integers with ints using Integer.MIN_VALUE as a marker, when acquiring descriptors, to avoid allocating 3 per process() loop
    • If the process soft-exits, Integers will still be created then

Extra bits:

  • Added IntelliJ IDEA files to .gitignore
  • Updated JNA to 4.5.1

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

We're investigating using NuProcess in Bitbucket Server, so I put together a simple jmh benchmark to try and get a feel for its performance when pumping a large amount of data from a process (serving Git hosting operations can require pumping an enormous amount of data). My benchmark was very simple:

public class NuProcessBenchmark {

    public static void main(String[] args) throws Exception {
        System.setProperty("com.zaxxer.nuprocess.threads", "cores");

        Options options = new OptionsBuilder()
                .addProfiler(GCProfiler.class)
                //.forks(0) //Set when running with JProfiler
                .forks(2)
                .jvmArgsAppend("-Dcom.zaxxer.nuprocess.threads=cores")
                .include(NuProcessBenchmark.class.getSimpleName())
                //.measurementIterations(20) // Set when running with JProfiler
                .measurementIterations(50)
                //.threads(5) // Set when running with JProfiler
                .threads(10)
                .timeUnit(TimeUnit.SECONDS)
                .warmupIterations(1)
                .build();

        new Runner(options).run();
    }

    @Benchmark
    @Test
    public void cat() throws Exception {
        NuProcessBuilder builder = new NuProcessBuilder("cat", "/tmp/random.dat");
        builder.setProcessListener(new NuAbstractProcessHandler() {

            private final CRC32 crc32 = new CRC32();

            @Override
            public void onExit(int statusCode) {
                assertEquals(3407529827L, crc32.getValue());
            }

            @Override
            public void onStdout(ByteBuffer buffer, boolean closed) {
                crc32.update(buffer);
            }
        });

        NuProcess process = builder.start();
        assertEquals(0, process.waitFor(1L, TimeUnit.DAYS));
    }
}

(/tmp/random.dat is a 250MB file created using using head -c 250M /dev/urandom >/tmp/random.dat)

When running the benchmark, I noticed that the GC churn appeared to be extremely high, triggering heavy GC. (jmh showed ~34MB/s churn, compared to <200KB/s churn from a basically identical benchmark using our existing processutils library)

Running JProfiler (my benchmark above has some comments with the values I used when profiling), analyzing allocation hotspots showed ProcessEpoll.process, with a graph like this:
image

JNA is the primary culprit, triggering well over 500MB in allocations, as well as allocating a fair number of Integers.

With the changes above, profiling the same benchmark now shows:
image

Notice that:

  • The number of Integer allocations has been massively reduced
  • The number of JNA allocations has been cut approximately in half, except for
  • Pointer allocations have been entirely removed

This reduces the reported churn rate from 34MB/s to 14MB/s, which results in significantly less GC runs.

The churn rate is extremely important to Bitbucket Server because we run an enormous number of Git processes and, on larger servers, can easily have over 100 git processes running in parallel. Not all of the processes will be pumping hundreds of megabytes of data, but such processes are anything but rare (and 250MB is far from the largest; repositories that are tens of gigabytes, while not common, are not exactly rare, either).

(Edit: For reference, the benchmark was run on a Linux Mint 18.3 x86_64 VM, with 8 VCPUs (AMD Threadripper 1950X) and 16GB of RAM, using Oracle Java 8u151 and Maven 3.3.9.)

// per bits/epoll.h, epoll_event is created with __attribute__ ((__packed__)), which disables
// applying padding to optimize alignment. epoll_event is memory-aligned on 32-bit platforms,
// but not on 64-bit platforms (i.e. it uses 32-bit alignment on 64-bit platforms)
super(ALIGN_NONE);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still doing some testing on 32-bit Linux to further verify this change

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us know when you've finished 32-bit verification.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, took a little bit of setup (I didn't have a 32-bit VM on hand) but I've just finished up testing it now. All of the unit tests pass on x86 Ubuntu 14.04. Once I've added the benchmark I wrote to the project I'll run that on my 32-bit environment as well for a little more stress testing.

Details for my 32-bit environment:

bturner@ubuntu:~/Development/oss/nuprocess$ uname -a
Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:06:37 UTC 2016 i686 athlon i686 GNU/Linux
bturner@ubuntu:~/Development/oss/nuprocess$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-linux-gnu/4.8/lto-wrapper
Target: i686-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04.3' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-i386 --with-arch-directory=i386 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-targets=all --enable-multiarch --disable-werror --with-arch-32=i686 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=i686-linux-gnu --host=i686-linux-gnu --target=i686-linux-gnu
Thread model: posix
gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

A slightly modified version of my sizeof program from #79 confirms (as expected) that the sizeof(struct epoll_event) is 12 on x86 Linux as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've run my jmh benchmark on my 32-bit VM now. Here's the summary:

# Run complete. Total time: 00:04:07

Benchmark                                                 Mode  Cnt       Score        Error    Units
NuProcessBenchmark.cat                                   thrpt  100      20.496 ±      0.486    ops/s
NuProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       7.703 ±      0.183   MB/sec
NuProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  540022.366 ±    472.536     B/op
NuProcessBenchmark.cat:·gc.churn.Metaspace               thrpt  100      ≈ 10⁻⁴                MB/sec
NuProcessBenchmark.cat:·gc.churn.Metaspace.norm          thrpt  100       2.786 ±      6.647     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       6.662 ±      4.691   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  518120.933 ± 367662.475     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Old_Gen              thrpt  100       0.520 ±      0.884   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Old_Gen.norm         thrpt  100   40171.064 ±  67837.749     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.057 ±      0.075   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    4530.189 ±   5891.110     B/op
NuProcessBenchmark.cat:·gc.count                         thrpt  100      29.000                counts
NuProcessBenchmark.cat:·gc.time                          thrpt  100    4980.000                    ms
NuProcessBenchmark.cat:·threads.alive                    thrpt  100      18.460 ±      0.170  threads
NuProcessBenchmark.cat:·threads.daemon                   thrpt  100      17.460 ±      0.170  threads
NuProcessBenchmark.cat:·threads.started                  thrpt  100      19.000               threads

20 ops/sec, given each op is a complete cat of a 250MB test file, translates to a throughput of roughly 5GB/s (using the default 4 pump threads, given my VM has 8 VCPUs). Not bad

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not bad. If you need higher throughput, you can experiment with calling fcntl with F_SETPIPE_SZ on the pipe. The Linux default pipe size is 64K, but can be set from user-space (via fcntl) up to 1MB. There’s probably a sweet spot in there, maybe between 128-256K.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course you would need to adjust the NuProcess buffer sizes to match.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, thinking about it, something like “cat” pretty much produces constant output, but a git command’s output is likely to be quite “bursty”. A burst of output larger than 64K will stall the git process until NuProcess can drain the pipe. So setting a larger pipe size might have good payoff there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git's an interesting case, because it doesn't behave quite how one might expect. For hosting operations (which, for obvious reasons, is what I care about the most here) it tends to reach a "steady state" fairly quickly. After the ref advertisement, it generates a pack with the requested objects and from there it's just streaming that pack. The wire protocol uses a packet format, which is bounded to 8K. That means it writes at most a shade over 8K (8197 bytes) at a time.

We support two different transports, HTTP (Tomcat) and SSH (Mina SSHD). Tomcat uses an 8K buffer by default for HttpServletResponse, so we use an 8K buffer when copying streams. For SSH, specifically because of Git's packet sizes, we're using an exact 8197 byte buffer to copy streams. These buffers effectively carry any backpressure from the socket through to the underlying Git process. If the socket can't move the data fast enough, the Git process gets blocked trying to write more output and slows down to match.

Of course, as the pump threads fall behind (which they almost certainly would, since they'd be marshaling over to a socket of some sort), while Git might not be writing more than ~8K per shot, that doesn't mean there can't be multiple packets waiting for the pump thread to read. But since those larger reads are still going to be funneled through a buffer somewhere around 8K on their way to the client, those larger reads may not offer much benefit (and, in fact, doing smaller, more packet-sized reads might allow the pump thread to dump output to more sockets per second, keeping perceived transfer rates more steady).

Setting hosting aside, a lot of what Bitbucket Server spends the rest of its time doing is running git rev-list, git diff-tree and git diff, with a few other commands interwoven here and there. For those commands, it's entirely possible they won't produce 64K of output over their entire execution, let alone in a single shot.

If anything, then, I think what we'd really want is a way to change NuProcess.BUFFER_CAPACITY from a hard constant to something that can, at least to some extent, be changed at runtime. The most flexible option would probably be to make it configurable on the NuProcessBuilder and use NuProcess.BUFFER_CAPACITY as a default, but even making it configurable via system property would let the application do some tuning.

// technically this union should have a "Pointer ptr" field, but, for how EpollData is actually
// used, only referencing the "fd" field, it's nothing but overhead. JNA will end up constructing
// them as part of ProcessEpoll's execution, but they never get used
//public Pointer ptr;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this completely avoids Pointer allocations in ProcessEpoll.process

public static final int EPOLL_CTL_MOD = 3; /* Remove a file decriptor from the interface. */
public static final int EPOLL_CTL_ADD = 1; /* Add a file descriptor to the interface. */
public static final int EPOLL_CTL_DEL = 2; /* Remove a file descriptor from the interface. */
public static final int EPOLL_CTL_MOD = 3; /* Change file descriptor epoll_event structure. */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments on EPOLL_CTL_MOD and EPOLL_CTL_DEL were backwards, and all three had "descriptor" misspelled

@@ -34,21 +34,21 @@

public static native int epoll_ctl(int epfd, int op, int fd, EpollEvent event);

// technically, events should be an array, but we only ever call with maxevents = 1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even to use an array, when using direct binding (Native.register) the parameter type would still be EpollEvent, not EpollEvent[]. Array parameters are not supported with direct binding.

Instead, to allow maxevents > 1, you'd do allocate an EpollEvent[] using toArray and then you'd pass the first element as events.

eventPool.add(new EpollEvent());
EpollEvent event = new EpollEvent();
// disable auto-read for these events, because they're only used to call epoll_ctl
event.setAutoRead(false);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduces SharedMemory allocations in registerProcess when (needlessly) copying from native memory to heap memory after calling epoll_ctl

deadPool = new LinkedList<LinuxProcess>();
// disable auto-read and auto-write for triggeredEvent; process() will explicitly call read()
// only if epoll_wait returns 1, and only call write() before invoking epoll_ctl
triggeredEvent.setAutoSynch(false);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reduces SharedMemory allocations in process:

  • The contents of the structure aren't read by epoll_wait, only written to, so they're irrelevant and a write() is nothing but overhead
  • If epoll_wait returns <1, there's no point in reading the native memory back onto the heap
  • Native memory is explicitly written prior to calling epoll_ctl to register a continued interest in stdin availability, but is not read back after the call

@@ -178,9 +184,9 @@ public void closeStdin(LinuxProcess process)
@Override
public boolean process()
{
Integer stdinFd = null;
Integer stdoutFd = null;
Integer stderrFd = null;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Integer and doing a null check allocates 3 Integers per (very tightly looped) process call. Individually, they don't matter much. But when pumping a large amount of data across a large number of processes, they contribute noticeably to GC churn. Using an int with a marker value helps avoid those allocations on many process calls but not on all of them.

  • ident is still boxed to call fildesToProcessMap.get
  • All three descriptors are boxed if the process becomes eligible for soft-exit cleanup

linuxProcess.getStderr().release();
}
}
triggeredEvent.clear();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "cleaner" approach, but the contents of triggeredEvent at the end of the call don't matter. It's guaranteed the next call will be to epoll_wait, which will not read those contents--it'll only overwrite them if a descriptor is ready.

checkDeadPool();
}
}

// ************************************************************************
// Private methods
// ************************************************************************
AtomicInteger count = new AtomicInteger();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

I'm not sure about the Travis failure. That doesn't look like anything I broke.

Feel free to ignore my .gitignore and pom.xml changes. I tried not to adjust whitespace or other formatting, to keep my changes down and make them easier to review.

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

(Also, I guess I should note that this will fix #79)

@lfbayer
Copy link
Collaborator

lfbayer commented Jan 12, 2018

I love these changes, Thanks!

It looks like travis-ci no longer supports "oraclejdk7". Would you mind changing the .travis.yml to point at "openjdk7" instead so that we can get the build green?

Everything looks like it's fine on my local box here.

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

@lfbayer Thanks for the speedy review! Happy to tweak the Travis config

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

Looks like CatTest.changeCwd is failing in Travis, but I'm not sure why. Platform, maybe? I tried Oracle Java 7u80 and 8u151 and OpenJDK 8u151 and all tests passed on all three.

@lfbayer
Copy link
Collaborator

lfbayer commented Jan 12, 2018

It definitely has nothing to do with any of your changes.

I think error: 1 would indicate EPERM, meaning there is some permission error.

The internets indicate that travis-ci does not give CAP_SYS_ADMIN, which would mean that this call would fail. But it isn't clear to me how the build ever succeeded if that is the case. I'm guessing now that it must be something that changed in travis-ci. But it's strange to me that we even require additional permissions. Unfortunately I'm not really familiar with unshare or CLONE_FS, maybe we're actually using it wrong.

It looks like @brettwooldridge is committing some changes that he was sitting on. I've chatted briefly about this issue with him and there are a couple things we will investigate.

}
else if ((events & LibEpoll.EPOLLOUT) != 0) // Room in stdin pipe available to write
{
} else if ((events & LibEpoll.EPOLLOUT) != 0) { // Room in stdin pipe available to write
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NuProcess code-style is close-curly on a line by itself.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd reverted all the places I changed this. Will fix

@brettwooldridge
Copy link
Owner

If you rebase onto master your travis-ci build should work now (you're likely to have a conflict in .travis.yml).

@brettwooldridge
Copy link
Owner

@bturner Finally, can you create a "benchmarks" directory and include your jmh benchmark?

- Disabled autoRead and/or autoWrite on EpollEvent structs depending on
  how they're used, in ProcessEpoll
  - EpollEvents in the "eventPool" have autoRead disabled because they
    are used a write-only context (Calling epoll_ctl)
  - The "triggeredEvent" EpollEvent has both autoRead/autoWrite disabled
    - When epoll_wait returns 1, read() is called explicitly, but if it
      returns 0 native memory is not read
    - Prior to calling epoll_ctl to register for further stdin events,
      write() is called explicitly
- Fixed memory alignment on EpollEvent and added a unit test to verify
  the size of the struct is 12 bytes
- Removed EpollEvent.EpollData.ptr because, even though it's never used,
  JNA still allocates it
- Replaced Integers with ints using Integer.MIN_VALUE as a marker, when
  acquiring descriptors, to avoid allocating 3 per process() loop
  - If the process soft-exits, Integers will still be created then

Extra bits:
- Added IntelliJ IDEA files to .gitignore
- Updated JNA to 4.5.1
@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

@brettwooldridge Sure, happy to add that in. Would you like it in a Maven profile to keep it out of the main build?

@bturner bturner mentioned this pull request Jan 12, 2018
@brettwooldridge
Copy link
Owner

@bturner If you have no further changes, I am prepared to merge this.

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

Sorry, had to step away to help put my son to bed. I'm just finishing up adding the benchmark. I can open a separate pull request for it if you prefer.

@brettwooldridge
Copy link
Owner

@bturner No it's fine, no hurry, just wait until the pull request is ready.

- NuProcessBenchmark takes a given test file and cats it to stdout,
  computing a CRC32 checksum to test for corruption
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.7.0</version>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had updated plugins in the main pom.xml as well, but to try and keep this change more focused I reverted those updates. I'm happy to push them separately, if you'd like.

Note that I need to use at least 3.5 to use annotationProcessorPaths in configuration

* That will display the CRC32 checksum in hex, which can then be converted to decimal to match {@code CRC32}'s
* {@code getValue()}. (The calculator on most operating systems can convert hex to decimal.)
*/
public class NuProcessBenchmark {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish that there was a way to make this a bit more self-contained. I had thought about just having it directly fork head -c 250MB /dev/urandom, but that relies on the system's ability to generate sufficient entropy. Otherwise it gets extremely slow, which would skew the benchmark. (It also makes it impossible to perform any sort of quick validation on the output to test for corruption, since the checksum would continuously change)

@@ -43,7 +44,7 @@
<name>Rich DiCroce</name>
</developer>
<developer>
<name>Chris Cladden</name>
<name>Chris Cladden</name>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fixed some whitespace here

*
* @param force if <code>true</code> is passed, STDIN will be immediately closed
* even if there are pending writes; otherwise, it will be closed
* after all pending writes are completed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed a Javadoc warning about this being missing

@bturner
Copy link
Collaborator Author

bturner commented Jan 12, 2018

Sorry for the delay, @brettwooldridge and @lfbayer . I got hung up on some details that I couldn't get quite how I wanted them.

For my benchmark, it requires Java 8 because it uses CRC32.update(ByteBuffer), which is new in 8. Other approaches skew the benchmark because they require allocating byte[]s to copy everything from the ByteBuffer.

MessageDigest has had update(ByteBuffer) since Java 5, but all of the digest algorithms are quite slow compared to a simple CRC32, and skew the benchmark as a result. I don't need a cryptographicaly-strong hash; I'm just using the CRC32 a) as something to "do" with the output and b) to perform a low-overhead check that the output was pumped "correctly"

@brettwooldridge
Copy link
Owner

It’s fine if the benchmark requires Java8.

@brettwooldridge brettwooldridge merged commit 4989a9e into brettwooldridge:master Jan 13, 2018
@brettwooldridge
Copy link
Owner

brettwooldridge commented Jan 14, 2018

@bturner Checkout these memory results after my latest refactor:

Iteration   4: 21.312 ops/s
                 ·gc.alloc.rate:      0.656 MB/sec
                 ·gc.alloc.rate.norm: 52814.261 B/op
                 ·gc.count:           ≈ 0 counts
                 ·threads.alive:      11.000 threads
                 ·threads.daemon:     10.000 threads
                 ·threads.started:    11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
  0.610 ±(99.9%) 0.127 MB/sec [Average]
  (min, avg, max) = (0.411, 0.610, 0.834), stdev = 0.124
  CI (99.9%): [0.484, 0.737] (assumes normal distribution)

Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
  54039.498 ±(99.9%) 2071.499 B/op [Average]
  (min, avg, max) = (52798.857, 54039.498, 59384.800), stdev = 2034.489
  CI (99.9%): [51967.999, 56110.998] (assumes normal distribution)

Compared to previous (on my Linux VM):

Iteration   4: 14.460 ops/s
                 ·gc.alloc.rate:      6.760 MB/sec
                 ·gc.alloc.rate.norm: 693246.857 B/op
                 ·gc.count:           ≈ 0 counts
                 ·threads.alive:      11.000 threads
                 ·threads.daemon:     10.000 threads
                 ·threads.started:    11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
  7.175 ±(99.9%) 1.383 MB/sec [Average]
  (min, avg, max) = (3.776, 7.175, 9.336), stdev = 1.358
  CI (99.9%): [5.793, 8.558] (assumes normal distribution)

Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
  694515.796 ±(99.9%) 2264.247 B/op [Average]
  (min, avg, max) = (693200.000, 694515.796, 699813.600), stdev = 2223.793
  CI (99.9%): [692251.549, 696780.043] (assumes normal distribution)

I profiled a jmh run with jProfiler ... ./benchmark.sh profile ... and as far as I can tell, 99% of the allocation was related to the benchmark itself. NuProcess itself, at least on Linux where these changes reside, should be running very close to zero-allocation; with the exception of process startup where things like the environment are cloned, and the 3 I/O buffers created.

@bturner I'd be interested to hear how this affects your throughput numbers.

@brettwooldridge
Copy link
Owner

@bturner Also fixed up the benchmark to self-calculate the CRC32.

@brettwooldridge
Copy link
Owner

@bturner The moral of the story seems to be that JNA is extremely inefficient by default, but can be efficient if used carefully.

Structures are extremely inefficient, with every field access throwing off Iterator instances.

And (possibly biggest impact) passing a ByteBuffer that was allocated by ByteBuffer.allocateDirect(), compared to new Memory(xxx).getByteBuffer(), across the JNA interface causes allocations of Memory$SharedMemory objects on every trip.

@bturner
Copy link
Collaborator Author

bturner commented Jan 16, 2018

That's an epic improvement, @brettwooldridge! Sensational. My benchmark run now shows:

# Run complete. Total time: 00:03:51

Benchmark                                                 Mode  Cnt       Score        Error    Units
NuProcessBenchmark.cat                                   thrpt  100      27.591 ±      0.678    ops/s
NuProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       4.168 ±      0.098   MB/sec
NuProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  231161.994 ±    387.435     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       4.503 ±      3.528   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  254029.703 ± 198985.617     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.025 ±      0.048   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    1382.581 ±   2662.888     B/op
NuProcessBenchmark.cat:·gc.count                         thrpt  100      16.000                counts
NuProcessBenchmark.cat:·gc.time                          thrpt  100     194.000                    ms
NuProcessBenchmark.cat:·threads.alive                    thrpt  100      23.510 ±      0.170  threads
NuProcessBenchmark.cat:·threads.daemon                   thrpt  100      22.510 ±      0.170  threads
NuProcessBenchmark.cat:·threads.started                  thrpt  100      24.000               threads

(Note that while the name is the same this isn't quite the same benchmark as the one in the repository. For consistency I need to keep using my existing benchmark, otherwise this result wouldn't really be a relevant comparison for me.)

The throughput shows a marginal improvement, about 2 ops/s but big enough to be statistically relevant, but the real win is definitely in that new allocation rate.

As a potentially relevant aside, this is the result of a similar (identical within my ability to make it so) benchmark run against jnr-process:

# Run complete. Total time: 00:03:39

Benchmark                                                  Mode  Cnt       Score        Error    Units
JnrProcessBenchmark.cat                                   thrpt  100      28.234 ±      0.222    ops/s
JnrProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       4.388 ±      0.046   MB/sec
JnrProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  229441.708 ±    515.061     B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       4.554 ±      3.559   MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  239207.232 ± 187075.828     B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.072 ±      0.109   MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    3699.963 ±   5587.508     B/op
JnrProcessBenchmark.cat:·gc.count                         thrpt  100      16.000                counts
JnrProcessBenchmark.cat:·gc.time                          thrpt  100     103.000                    ms
JnrProcessBenchmark.cat:·threads.alive                    thrpt  100      16.980 ±      0.048  threads
JnrProcessBenchmark.cat:·threads.daemon                   thrpt  100      15.980 ±      0.048  threads
JnrProcessBenchmark.cat:·threads.started                  thrpt  100      17.000               threads

The allocation rates between the two are really close now, suggesting a transition over to JNR might not really have much "win" to it, from that perspective, now that NuProcess's JNA approach has been so aggressively tuned.

@bturner bturner deleted the optimize-epoll-processing branch January 16, 2018 20:56
@brettwooldridge
Copy link
Owner

brettwooldridge commented Jan 17, 2018

@bturner Thanks for the feedback! Very encouraging.

As I think about the allocation rate, I thought about pooling. Currently, when a process is started and a new instance of BasePosixProcess (LinuxProcess) is created, I allocate the three buffers for stdin/stdout/stderr. These are 64K each. When the process ends, these become garbage to be collected. At a rate of 20+ processes per second, this results in a several MB/sec allocation rate.

NuProcess could likely benefit from a pooling scheme for these buffers, whereby when the process ends, the buffers are returned to a pool, to be picked up by the next process run.

A pool could be associated with each processing thread, which would have two benefits. First, because the pool would only be accessed by a single thread, it could be non-concurrent/lock-free. Second, processing threads have a finite lifetime, so if no processes are run for 2.5s (default) the thread terminates and the pool would go with it, allowing memory to be given back during idle periods.

During busy periods, this would result in a near zero allocation rate, except for a handful of minor objects; the buffers clearly being the bulk of the memory overhead of a process.

@brettwooldridge
Copy link
Owner

I burned some midnight oil and implemented pooling of Memory/ByteBuffer objects. Much to my astonishment, the benchmark showed almost no difference at all in terms of allocation rate.

Either JNA or the JVM itself is playing a game behind the scenes to reuse the objects in the non-pooling case.

I’d like to measure one of the non-jmh tests with some of the heap analysis tools to eliminate jmh ‘s own allocation from the picture.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants