Optimize epoll_wait loop to avoid memory churn. #80

bturner · 2018-01-12T02:55:04Z

Disabled autoRead and/or autoWrite on EpollEvent structs depending on how they're used, in ProcessEpoll
- EpollEvents in the eventPool have autoRead disabled because they are used a write-only context (Calling epoll_ctl)
- The triggeredEvent EpollEvent has both autoRead/autoWrite disabled
  - When epoll_wait returns 1, read() is called explicitly, but if it returns 0 native memory is not read
  - Prior to calling epoll_ctl to register for further stdin events, write() is called explicitly
Fixed memory alignment on EpollEvent and added a unit test to verify the size of the struct is 12 bytes
Removed EpollEvent.EpollData.ptr because, even though it's never used, JNA still allocates it
Replaced Integers with ints using Integer.MIN_VALUE as a marker, when acquiring descriptors, to avoid allocating 3 per process() loop
- If the process soft-exits, Integers will still be created then

Extra bits:

Added IntelliJ IDEA files to .gitignore
Updated JNA to 4.5.1

bturner · 2018-01-12T03:12:09Z

We're investigating using NuProcess in Bitbucket Server, so I put together a simple jmh benchmark to try and get a feel for its performance when pumping a large amount of data from a process (serving Git hosting operations can require pumping an enormous amount of data). My benchmark was very simple:

public class NuProcessBenchmark {

    public static void main(String[] args) throws Exception {
        System.setProperty("com.zaxxer.nuprocess.threads", "cores");

        Options options = new OptionsBuilder()
                .addProfiler(GCProfiler.class)
                //.forks(0) //Set when running with JProfiler
                .forks(2)
                .jvmArgsAppend("-Dcom.zaxxer.nuprocess.threads=cores")
                .include(NuProcessBenchmark.class.getSimpleName())
                //.measurementIterations(20) // Set when running with JProfiler
                .measurementIterations(50)
                //.threads(5) // Set when running with JProfiler
                .threads(10)
                .timeUnit(TimeUnit.SECONDS)
                .warmupIterations(1)
                .build();

        new Runner(options).run();
    }

    @Benchmark
    @Test
    public void cat() throws Exception {
        NuProcessBuilder builder = new NuProcessBuilder("cat", "/tmp/random.dat");
        builder.setProcessListener(new NuAbstractProcessHandler() {

            private final CRC32 crc32 = new CRC32();

            @Override
            public void onExit(int statusCode) {
                assertEquals(3407529827L, crc32.getValue());
            }

            @Override
            public void onStdout(ByteBuffer buffer, boolean closed) {
                crc32.update(buffer);
            }
        });

        NuProcess process = builder.start();
        assertEquals(0, process.waitFor(1L, TimeUnit.DAYS));
    }
}

(/tmp/random.dat is a 250MB file created using using head -c 250M /dev/urandom >/tmp/random.dat)

When running the benchmark, I noticed that the GC churn appeared to be extremely high, triggering heavy GC. (jmh showed ~34MB/s churn, compared to <200KB/s churn from a basically identical benchmark using our existing processutils library)

Running JProfiler (my benchmark above has some comments with the values I used when profiling), analyzing allocation hotspots showed ProcessEpoll.process, with a graph like this:

JNA is the primary culprit, triggering well over 500MB in allocations, as well as allocating a fair number of Integers.

With the changes above, profiling the same benchmark now shows:

Notice that:

The number of Integer allocations has been massively reduced
The number of JNA allocations has been cut approximately in half, except for
Pointer allocations have been entirely removed

This reduces the reported churn rate from 34MB/s to 14MB/s, which results in significantly less GC runs.

The churn rate is extremely important to Bitbucket Server because we run an enormous number of Git processes and, on larger servers, can easily have over 100 git processes running in parallel. Not all of the processes will be pumping hundreds of megabytes of data, but such processes are anything but rare (and 250MB is far from the largest; repositories that are tens of gigabytes, while not common, are not exactly rare, either).

(Edit: For reference, the benchmark was run on a Linux Mint 18.3 x86_64 VM, with 8 VCPUs (AMD Threadripper 1950X) and 16GB of RAM, using Oracle Java 8u151 and Maven 3.3.9.)

bturner · 2018-01-12T03:12:56Z

src/main/java/com/zaxxer/nuprocess/linux/EpollEvent.java

+      // per bits/epoll.h, epoll_event is created with __attribute__ ((__packed__)), which disables
+      // applying padding to optimize alignment. epoll_event is memory-aligned on 32-bit platforms,
+      // but not on 64-bit platforms (i.e. it uses 32-bit alignment on 64-bit platforms)
+      super(ALIGN_NONE);


I'm still doing some testing on 32-bit Linux to further verify this change

Let us know when you've finished 32-bit verification.

Sorry, took a little bit of setup (I didn't have a 32-bit VM on hand) but I've just finished up testing it now. All of the unit tests pass on x86 Ubuntu 14.04. Once I've added the benchmark I wrote to the project I'll run that on my 32-bit environment as well for a little more stress testing.

Details for my 32-bit environment:

bturner@ubuntu:~/Development/oss/nuprocess$ uname -a Linux ubuntu 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:06:37 UTC 2016 i686 athlon i686 GNU/Linux bturner@ubuntu:~/Development/oss/nuprocess$ gcc -v Using built-in specs. COLLECT_GCC=gcc COLLECT_LTO_WRAPPER=/usr/lib/gcc/i686-linux-gnu/4.8/lto-wrapper Target: i686-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.8.4-2ubuntu1~14.04.3' --with-bugurl=file:///usr/share/doc/gcc-4.8/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.8 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.8 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-libmudflap --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.8-i386 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.8-i386 --with-arch-directory=i386 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-targets=all --enable-multiarch --disable-werror --with-arch-32=i686 --with-multilib-list=m32,m64,mx32 --with-tune=generic --enable-checking=release --build=i686-linux-gnu --host=i686-linux-gnu --target=i686-linux-gnu Thread model: posix gcc version 4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04.3)

A slightly modified version of my sizeof program from #79 confirms (as expected) that the sizeof(struct epoll_event) is 12 on x86 Linux as well.

I've run my jmh benchmark on my 32-bit VM now. Here's the summary:

# Run complete. Total time: 00:04:07 Benchmark Mode Cnt Score Error Units NuProcessBenchmark.cat thrpt 100 20.496 ± 0.486 ops/s NuProcessBenchmark.cat:·gc.alloc.rate thrpt 100 7.703 ± 0.183 MB/sec NuProcessBenchmark.cat:·gc.alloc.rate.norm thrpt 100 540022.366 ± 472.536 B/op NuProcessBenchmark.cat:·gc.churn.Metaspace thrpt 100 ≈ 10⁻⁴ MB/sec NuProcessBenchmark.cat:·gc.churn.Metaspace.norm thrpt 100 2.786 ± 6.647 B/op NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space thrpt 100 6.662 ± 4.691 MB/sec NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm thrpt 100 518120.933 ± 367662.475 B/op NuProcessBenchmark.cat:·gc.churn.PS_Old_Gen thrpt 100 0.520 ± 0.884 MB/sec NuProcessBenchmark.cat:·gc.churn.PS_Old_Gen.norm thrpt 100 40171.064 ± 67837.749 B/op NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space thrpt 100 0.057 ± 0.075 MB/sec NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm thrpt 100 4530.189 ± 5891.110 B/op NuProcessBenchmark.cat:·gc.count thrpt 100 29.000 counts NuProcessBenchmark.cat:·gc.time thrpt 100 4980.000 ms NuProcessBenchmark.cat:·threads.alive thrpt 100 18.460 ± 0.170 threads NuProcessBenchmark.cat:·threads.daemon thrpt 100 17.460 ± 0.170 threads NuProcessBenchmark.cat:·threads.started thrpt 100 19.000 threads

20 ops/sec, given each op is a complete cat of a 250MB test file, translates to a throughput of roughly 5GB/s (using the default 4 pump threads, given my VM has 8 VCPUs). Not bad

Not bad. If you need higher throughput, you can experiment with calling fcntl with F_SETPIPE_SZ on the pipe. The Linux default pipe size is 64K, but can be set from user-space (via fcntl) up to 1MB. There’s probably a sweet spot in there, maybe between 128-256K.

Of course you would need to adjust the NuProcess buffer sizes to match.

Actually, thinking about it, something like “cat” pretty much produces constant output, but a git command’s output is likely to be quite “bursty”. A burst of output larger than 64K will stall the git process until NuProcess can drain the pipe. So setting a larger pipe size might have good payoff there.

Git's an interesting case, because it doesn't behave quite how one might expect. For hosting operations (which, for obvious reasons, is what I care about the most here) it tends to reach a "steady state" fairly quickly. After the ref advertisement, it generates a pack with the requested objects and from there it's just streaming that pack. The wire protocol uses a packet format, which is bounded to 8K. That means it writes at most a shade over 8K (8197 bytes) at a time.

We support two different transports, HTTP (Tomcat) and SSH (Mina SSHD). Tomcat uses an 8K buffer by default for HttpServletResponse, so we use an 8K buffer when copying streams. For SSH, specifically because of Git's packet sizes, we're using an exact 8197 byte buffer to copy streams. These buffers effectively carry any backpressure from the socket through to the underlying Git process. If the socket can't move the data fast enough, the Git process gets blocked trying to write more output and slows down to match.

Of course, as the pump threads fall behind (which they almost certainly would, since they'd be marshaling over to a socket of some sort), while Git might not be writing more than ~8K per shot, that doesn't mean there can't be multiple packets waiting for the pump thread to read. But since those larger reads are still going to be funneled through a buffer somewhere around 8K on their way to the client, those larger reads may not offer much benefit (and, in fact, doing smaller, more packet-sized reads might allow the pump thread to dump output to more sockets per second, keeping perceived transfer rates more steady).

Setting hosting aside, a lot of what Bitbucket Server spends the rest of its time doing is running git rev-list, git diff-tree and git diff, with a few other commands interwoven here and there. For those commands, it's entirely possible they won't produce 64K of output over their entire execution, let alone in a single shot.

If anything, then, I think what we'd really want is a way to change NuProcess.BUFFER_CAPACITY from a hard constant to something that can, at least to some extent, be changed at runtime. The most flexible option would probably be to make it configurable on the NuProcessBuilder and use NuProcess.BUFFER_CAPACITY as a default, but even making it configurable via system property would let the application do some tuning.

bturner · 2018-01-12T03:13:25Z

src/main/java/com/zaxxer/nuprocess/linux/EpollEvent.java

+      // technically this union should have a "Pointer ptr" field, but, for how EpollData is actually
+      // used, only referencing the "fd" field, it's nothing but overhead. JNA will end up constructing
+      // them as part of ProcessEpoll's execution, but they never get used
+      //public Pointer ptr;


Removing this completely avoids Pointer allocations in ProcessEpoll.process

bturner · 2018-01-12T03:13:53Z

src/main/java/com/zaxxer/nuprocess/linux/LibEpoll.java

-   public static final int EPOLL_CTL_MOD = 3; /* Remove a file decriptor from the interface.  */
+   public static final int EPOLL_CTL_ADD = 1; /* Add a file descriptor to the interface.  */
+   public static final int EPOLL_CTL_DEL = 2; /* Remove a file descriptor from the interface.  */
+   public static final int EPOLL_CTL_MOD = 3; /* Change file descriptor epoll_event structure.  */


The comments on EPOLL_CTL_MOD and EPOLL_CTL_DEL were backwards, and all three had "descriptor" misspelled

bturner · 2018-01-12T03:15:41Z

src/main/java/com/zaxxer/nuprocess/linux/LibEpoll.java

@@ -34,21 +34,21 @@

   public static native int epoll_ctl(int epfd, int op, int fd, EpollEvent event);

-   // technically, events should be an array, but we only ever call with maxevents = 1


Even to use an array, when using direct binding (Native.register) the parameter type would still be EpollEvent, not EpollEvent[]. Array parameters are not supported with direct binding.

Instead, to allow maxevents > 1, you'd do allocate an EpollEvent[] using toArray and then you'd pass the first element as events.

bturner · 2018-01-12T03:16:30Z

src/main/java/com/zaxxer/nuprocess/linux/ProcessEpoll.java

-         eventPool.add(new EpollEvent());
+         EpollEvent event = new EpollEvent();
+         // disable auto-read for these events, because they're only used to call epoll_ctl
+         event.setAutoRead(false);


This reduces SharedMemory allocations in registerProcess when (needlessly) copying from native memory to heap memory after calling epoll_ctl

bturner · 2018-01-12T03:19:06Z

src/main/java/com/zaxxer/nuprocess/linux/ProcessEpoll.java

-      deadPool = new LinkedList<LinuxProcess>();
+      // disable auto-read and auto-write for triggeredEvent; process() will explicitly call read()
+      // only if epoll_wait returns 1, and only call write() before invoking epoll_ctl
+      triggeredEvent.setAutoSynch(false);


This reduces SharedMemory allocations in process:

The contents of the structure aren't read by epoll_wait, only written to, so they're irrelevant and a write() is nothing but overhead

If epoll_wait returns <1, there's no point in reading the native memory back onto the heap

Native memory is explicitly written prior to calling epoll_ctl to register a continued interest in stdin availability, but is not read back after the call

bturner · 2018-01-12T03:22:14Z

src/main/java/com/zaxxer/nuprocess/linux/ProcessEpoll.java

@@ -178,9 +184,9 @@ public void closeStdin(LinuxProcess process)
   @Override
   public boolean process()
   {
-      Integer stdinFd = null;
-      Integer stdoutFd = null;
-      Integer stderrFd = null;


Using Integer and doing a null check allocates 3 Integers per (very tightly looped) process call. Individually, they don't matter much. But when pumping a large amount of data across a large number of processes, they contribute noticeably to GC churn. Using an int with a marker value helps avoid those allocations on many process calls but not on all of them.

ident is still boxed to call fildesToProcessMap.get

All three descriptors are boxed if the process becomes eligible for soft-exit cleanup

bturner · 2018-01-12T03:23:26Z

src/main/java/com/zaxxer/nuprocess/linux/ProcessEpoll.java

               linuxProcess.getStderr().release();
            }
         }
-         triggeredEvent.clear();


This is a "cleaner" approach, but the contents of triggeredEvent at the end of the call don't matter. It's guaranteed the next call will be to epoll_wait, which will not read those contents--it'll only overwrite them if a descriptor is ready.

bturner · 2018-01-12T03:23:34Z

src/main/java/com/zaxxer/nuprocess/linux/ProcessEpoll.java

         checkDeadPool();
      }
   }

   // ************************************************************************
   //                             Private methods
   // ************************************************************************
-   AtomicInteger count = new AtomicInteger();


bturner · 2018-01-12T03:25:42Z

I'm not sure about the Travis failure. That doesn't look like anything I broke.

Feel free to ignore my .gitignore and pom.xml changes. I tried not to adjust whitespace or other formatting, to keep my changes down and make them easier to review.

bturner · 2018-01-12T03:26:02Z

(Also, I guess I should note that this will fix #79)

lfbayer · 2018-01-12T03:26:08Z

I love these changes, Thanks!

It looks like travis-ci no longer supports "oraclejdk7". Would you mind changing the .travis.yml to point at "openjdk7" instead so that we can get the build green?

Everything looks like it's fine on my local box here.

bturner · 2018-01-12T03:29:23Z

@lfbayer Thanks for the speedy review! Happy to tweak the Travis config

bturner · 2018-01-12T03:43:18Z

Looks like CatTest.changeCwd is failing in Travis, but I'm not sure why. Platform, maybe? I tried Oracle Java 7u80 and 8u151 and OpenJDK 8u151 and all tests passed on all three.

lfbayer · 2018-01-12T04:13:29Z

It definitely has nothing to do with any of your changes.

I think error: 1 would indicate EPERM, meaning there is some permission error.

The internets indicate that travis-ci does not give CAP_SYS_ADMIN, which would mean that this call would fail. But it isn't clear to me how the build ever succeeded if that is the case. I'm guessing now that it must be something that changed in travis-ci. But it's strange to me that we even require additional permissions. Unfortunately I'm not really familiar with unshare or CLONE_FS, maybe we're actually using it wrong.

It looks like @brettwooldridge is committing some changes that he was sitting on. I've chatted briefly about this issue with him and there are a couple things we will investigate.

brettwooldridge · 2018-01-12T05:27:43Z

src/main/java/com/zaxxer/nuprocess/linux/ProcessEpoll.java

-         }
-         else if ((events & LibEpoll.EPOLLOUT) != 0) // Room in stdin pipe available to write
-         {
+         } else if ((events & LibEpoll.EPOLLOUT) != 0) { // Room in stdin pipe available to write


NuProcess code-style is close-curly on a line by itself.

I thought I'd reverted all the places I changed this. Will fix

brettwooldridge · 2018-01-12T05:35:49Z

If you rebase onto master your travis-ci build should work now (you're likely to have a conflict in .travis.yml).

brettwooldridge · 2018-01-12T05:46:48Z

@bturner Finally, can you create a "benchmarks" directory and include your jmh benchmark?

- Disabled autoRead and/or autoWrite on EpollEvent structs depending on how they're used, in ProcessEpoll - EpollEvents in the "eventPool" have autoRead disabled because they are used a write-only context (Calling epoll_ctl) - The "triggeredEvent" EpollEvent has both autoRead/autoWrite disabled - When epoll_wait returns 1, read() is called explicitly, but if it returns 0 native memory is not read - Prior to calling epoll_ctl to register for further stdin events, write() is called explicitly - Fixed memory alignment on EpollEvent and added a unit test to verify the size of the struct is 12 bytes - Removed EpollEvent.EpollData.ptr because, even though it's never used, JNA still allocates it - Replaced Integers with ints using Integer.MIN_VALUE as a marker, when acquiring descriptors, to avoid allocating 3 per process() loop - If the process soft-exits, Integers will still be created then Extra bits: - Added IntelliJ IDEA files to .gitignore - Updated JNA to 4.5.1

bturner · 2018-01-12T06:01:42Z

@brettwooldridge Sure, happy to add that in. Would you like it in a Maven profile to keep it out of the main build?

brettwooldridge · 2018-01-12T07:15:00Z

@bturner If you have no further changes, I am prepared to merge this.

bturner · 2018-01-12T08:18:23Z

Sorry, had to step away to help put my son to bed. I'm just finishing up adding the benchmark. I can open a separate pull request for it if you prefer.

brettwooldridge · 2018-01-12T08:58:48Z

@bturner No it's fine, no hurry, just wait until the pull request is ready.

- NuProcessBenchmark takes a given test file and cats it to stdout, computing a CRC32 checksum to test for corruption

bturner · 2018-01-12T23:31:57Z

benchmark/pom.xml

+            <plugin>
+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-compiler-plugin</artifactId>
+                <version>3.7.0</version>


I had updated plugins in the main pom.xml as well, but to try and keep this change more focused I reverted those updates. I'm happy to push them separately, if you'd like.

Note that I need to use at least 3.5 to use annotationProcessorPaths in configuration

bturner · 2018-01-12T23:33:24Z

benchmark/src/main/java/com/zaxxer/nuprocess/benchmark/NuProcessBenchmark.java

+ * That will display the CRC32 checksum in hex, which can then be converted to decimal to match {@code CRC32}'s
+ * {@code getValue()}. (The calculator on most operating systems can convert hex to decimal.)
+ */
+public class NuProcessBenchmark {


I wish that there was a way to make this a bit more self-contained. I had thought about just having it directly fork head -c 250MB /dev/urandom, but that relies on the system's ability to generate sufficient entropy. Otherwise it gets extremely slow, which would skew the benchmark. (It also makes it impossible to perform any sort of quick validation on the output to test for corruption, since the checksum would continuously change)

bturner · 2018-01-12T23:33:42Z

pom.xml

@@ -43,7 +44,7 @@
            <name>Rich DiCroce</name>
        </developer>
        <developer>
-        	<name>Chris Cladden</name>
+            <name>Chris Cladden</name>


Just fixed some whitespace here

bturner · 2018-01-12T23:34:03Z

src/main/java/com/zaxxer/nuprocess/NuProcess.java

+    *
+    * @param force if <code>true</code> is passed, STDIN will be immediately closed
+    *              even if there are pending writes; otherwise, it will be closed
+    *              after all pending writes are completed


I noticed a Javadoc warning about this being missing

bturner · 2018-01-12T23:37:31Z

Sorry for the delay, @brettwooldridge and @lfbayer . I got hung up on some details that I couldn't get quite how I wanted them.

For my benchmark, it requires Java 8 because it uses CRC32.update(ByteBuffer), which is new in 8. Other approaches skew the benchmark because they require allocating byte[]s to copy everything from the ByteBuffer.

MessageDigest has had update(ByteBuffer) since Java 5, but all of the digest algorithms are quite slow compared to a simple CRC32, and skew the benchmark as a result. I don't need a cryptographicaly-strong hash; I'm just using the CRC32 a) as something to "do" with the output and b) to perform a low-overhead check that the output was pumped "correctly"

brettwooldridge · 2018-01-13T04:30:05Z

It’s fine if the benchmark requires Java8.

brettwooldridge · 2018-01-14T16:03:40Z

@bturner Checkout these memory results after my latest refactor:

Iteration   4: 21.312 ops/s
                 ·gc.alloc.rate:      0.656 MB/sec
                 ·gc.alloc.rate.norm: 52814.261 B/op
                 ·gc.count:           ≈ 0 counts
                 ·threads.alive:      11.000 threads
                 ·threads.daemon:     10.000 threads
                 ·threads.started:    11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
  0.610 ±(99.9%) 0.127 MB/sec [Average]
  (min, avg, max) = (0.411, 0.610, 0.834), stdev = 0.124
  CI (99.9%): [0.484, 0.737] (assumes normal distribution)

Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
  54039.498 ±(99.9%) 2071.499 B/op [Average]
  (min, avg, max) = (52798.857, 54039.498, 59384.800), stdev = 2034.489
  CI (99.9%): [51967.999, 56110.998] (assumes normal distribution)

Compared to previous (on my Linux VM):

Iteration   4: 14.460 ops/s
                 ·gc.alloc.rate:      6.760 MB/sec
                 ·gc.alloc.rate.norm: 693246.857 B/op
                 ·gc.count:           ≈ 0 counts
                 ·threads.alive:      11.000 threads
                 ·threads.daemon:     10.000 threads
                 ·threads.started:    11.000 threads
...
Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate":
  7.175 ±(99.9%) 1.383 MB/sec [Average]
  (min, avg, max) = (3.776, 7.175, 9.336), stdev = 1.358
  CI (99.9%): [5.793, 8.558] (assumes normal distribution)

Secondary result "com.zaxxer.nuprocess.benchmark.NuProcessBenchmark.cat:·gc.alloc.rate.norm":
  694515.796 ±(99.9%) 2264.247 B/op [Average]
  (min, avg, max) = (693200.000, 694515.796, 699813.600), stdev = 2223.793
  CI (99.9%): [692251.549, 696780.043] (assumes normal distribution)

I profiled a jmh run with jProfiler ... ./benchmark.sh profile ... and as far as I can tell, 99% of the allocation was related to the benchmark itself. NuProcess itself, at least on Linux where these changes reside, should be running very close to zero-allocation; with the exception of process startup where things like the environment are cloned, and the 3 I/O buffers created.

@bturner I'd be interested to hear how this affects your throughput numbers.

brettwooldridge · 2018-01-14T16:29:33Z

@bturner Also fixed up the benchmark to self-calculate the CRC32.

brettwooldridge · 2018-01-14T16:42:25Z

@bturner The moral of the story seems to be that JNA is extremely inefficient by default, but can be efficient if used carefully.

Structures are extremely inefficient, with every field access throwing off Iterator instances.

And (possibly biggest impact) passing a ByteBuffer that was allocated by ByteBuffer.allocateDirect(), compared to new Memory(xxx).getByteBuffer(), across the JNA interface causes allocations of Memory$SharedMemory objects on every trip.

bturner · 2018-01-16T20:55:51Z

That's an epic improvement, @brettwooldridge! Sensational. My benchmark run now shows:

# Run complete. Total time: 00:03:51

Benchmark                                                 Mode  Cnt       Score        Error    Units
NuProcessBenchmark.cat                                   thrpt  100      27.591 ±      0.678    ops/s
NuProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       4.168 ±      0.098   MB/sec
NuProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  231161.994 ±    387.435     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       4.503 ±      3.528   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  254029.703 ± 198985.617     B/op
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.025 ±      0.048   MB/sec
NuProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    1382.581 ±   2662.888     B/op
NuProcessBenchmark.cat:·gc.count                         thrpt  100      16.000                counts
NuProcessBenchmark.cat:·gc.time                          thrpt  100     194.000                    ms
NuProcessBenchmark.cat:·threads.alive                    thrpt  100      23.510 ±      0.170  threads
NuProcessBenchmark.cat:·threads.daemon                   thrpt  100      22.510 ±      0.170  threads
NuProcessBenchmark.cat:·threads.started                  thrpt  100      24.000               threads

(Note that while the name is the same this isn't quite the same benchmark as the one in the repository. For consistency I need to keep using my existing benchmark, otherwise this result wouldn't really be a relevant comparison for me.)

The throughput shows a marginal improvement, about 2 ops/s but big enough to be statistically relevant, but the real win is definitely in that new allocation rate.

As a potentially relevant aside, this is the result of a similar (identical within my ability to make it so) benchmark run against jnr-process:

# Run complete. Total time: 00:03:39

Benchmark                                                  Mode  Cnt       Score        Error    Units
JnrProcessBenchmark.cat                                   thrpt  100      28.234 ±      0.222    ops/s
JnrProcessBenchmark.cat:·gc.alloc.rate                    thrpt  100       4.388 ±      0.046   MB/sec
JnrProcessBenchmark.cat:·gc.alloc.rate.norm               thrpt  100  229441.708 ±    515.061     B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space           thrpt  100       4.554 ±      3.559   MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Eden_Space.norm      thrpt  100  239207.232 ± 187075.828     B/op
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space       thrpt  100       0.072 ±      0.109   MB/sec
JnrProcessBenchmark.cat:·gc.churn.PS_Survivor_Space.norm  thrpt  100    3699.963 ±   5587.508     B/op
JnrProcessBenchmark.cat:·gc.count                         thrpt  100      16.000                counts
JnrProcessBenchmark.cat:·gc.time                          thrpt  100     103.000                    ms
JnrProcessBenchmark.cat:·threads.alive                    thrpt  100      16.980 ±      0.048  threads
JnrProcessBenchmark.cat:·threads.daemon                   thrpt  100      15.980 ±      0.048  threads
JnrProcessBenchmark.cat:·threads.started                  thrpt  100      17.000               threads

The allocation rates between the two are really close now, suggesting a transition over to JNR might not really have much "win" to it, from that perspective, now that NuProcess's JNA approach has been so aggressively tuned.

brettwooldridge · 2018-01-17T15:45:19Z

@bturner Thanks for the feedback! Very encouraging.

As I think about the allocation rate, I thought about pooling. Currently, when a process is started and a new instance of BasePosixProcess (LinuxProcess) is created, I allocate the three buffers for stdin/stdout/stderr. These are 64K each. When the process ends, these become garbage to be collected. At a rate of 20+ processes per second, this results in a several MB/sec allocation rate.

NuProcess could likely benefit from a pooling scheme for these buffers, whereby when the process ends, the buffers are returned to a pool, to be picked up by the next process run.

A pool could be associated with each processing thread, which would have two benefits. First, because the pool would only be accessed by a single thread, it could be non-concurrent/lock-free. Second, processing threads have a finite lifetime, so if no processes are run for 2.5s (default) the thread terminates and the pool would go with it, allowing memory to be given back during idle periods.

During busy periods, this would result in a near zero allocation rate, except for a handful of minor objects; the buffers clearly being the bulk of the memory overhead of a process.

brettwooldridge · 2018-01-18T11:18:58Z

I burned some midnight oil and implemented pooling of Memory/ByteBuffer objects. Much to my astonishment, the benchmark showed almost no difference at all in terms of allocation rate.

Either JNA or the JVM itself is playing a game behind the scenes to reuse the objects in the non-pooling case.

I’d like to measure one of the non-jmh tests with some of the heap analysis tools to eliminate jmh ‘s own allocation from the picture.

bturner commented Jan 12, 2018

View reviewed changes

brettwooldridge reviewed Jan 12, 2018

View reviewed changes

bturner force-pushed the optimize-epoll-processing branch from a9ee0ed to 2494eb6 Compare January 12, 2018 05:56

bturner mentioned this pull request Jan 12, 2018

Question: JNA vs JNR #81

Closed

Add a simple jmh benchmark.

8ba14a5

- NuProcessBenchmark takes a given test file and cats it to stdout, computing a CRC32 checksum to test for corruption

bturner commented Jan 12, 2018

View reviewed changes

brettwooldridge merged commit 4989a9e into brettwooldridge:master Jan 13, 2018

bturner mentioned this pull request Jan 15, 2018

EpollEvent memory alignment is incorrect #79

Closed

bturner deleted the optimize-epoll-processing branch January 16, 2018 20:56

lfbayer mentioned this pull request Mar 15, 2018

unshare(CLONE_FS) failed, return code: -1, last error: 1 #84

Closed

bturner mentioned this pull request Feb 12, 2024

Don't depend on the GC to free native memory on Linux #151

Open

		@@ -34,21 +34,21 @@

		public static native int epoll_ctl(int epfd, int op, int fd, EpollEvent event);

		// technically, events should be an array, but we only ever call with maxevents = 1

Optimize epoll_wait loop to avoid memory churn. #80

Optimize epoll_wait loop to avoid memory churn. #80

Conversation

bturner commented Jan 12, 2018

bturner commented Jan 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bturner commented Jan 12, 2018

bturner commented Jan 12, 2018

lfbayer commented Jan 12, 2018

bturner commented Jan 12, 2018

bturner commented Jan 12, 2018

lfbayer commented Jan 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brettwooldridge commented Jan 12, 2018

brettwooldridge commented Jan 12, 2018

bturner commented Jan 12, 2018

brettwooldridge commented Jan 12, 2018

bturner commented Jan 12, 2018

brettwooldridge commented Jan 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bturner commented Jan 12, 2018

brettwooldridge commented Jan 13, 2018

brettwooldridge commented Jan 14, 2018 • edited Loading

brettwooldridge commented Jan 14, 2018

brettwooldridge commented Jan 14, 2018

bturner commented Jan 16, 2018

brettwooldridge commented Jan 17, 2018 • edited Loading

brettwooldridge commented Jan 18, 2018

bturner commented Jan 12, 2018 •

edited

Loading

brettwooldridge commented Jan 14, 2018 •

edited

Loading

brettwooldridge commented Jan 17, 2018 •

edited

Loading