Skip to content

Down the Rabbit Hole

brettwooldridge edited this page Oct 31, 2013 · 48 revisions

This is where we give away the recipe to the secret sauce. When you come in with benchmarks like ours there is a certain amount of skepticism that must be addressed.

We're in your bytecodez

In order to make HikariCP as fast as it is, we went down to bytecode-level engineering, and beyond. We pulled out every trick we know to help the JIT help you. We studied the bytecode output of the compiler, and even the assembly output of the JIT to limit key routines to less then the JIT inline-threshold. We flattened inheritance hierarchies, shadowed member variables, eliminated casts.

Sometimes seeing that a routine was surprisingly over the inline-threshold, we would figure out how to squeeze a few extra bytecodes out. Take this simple example:

public SQLException checkException(SQLException sqle) {
    String sqlState = sqle.getSQLState();
    if (sqlState == null)
        return sqle;

    if (sqlState.startsWith("08"))
        _forceClose = true;
    else if (SQL_ERRORS.contains(sqlState))
        _forceClose = true;
    return sqle;
}

Simple enough method, checking if the SQLSTATE of an exception indicates a disconnection error. Here is the bytecode:

     0: aload_1
     1: invokevirtual #148                // Method java/sql/SQLException.getSQLState:()Ljava/lang/String;
     4: astore_2
     5: aload_2
     6: ifnonnull     11
     9: aload_1
    10: areturn
    11: aload_2
    12: ldc           #154                // String 08
    14: invokevirtual #156                // Method java/lang/String.startsWith:(Ljava/lang/String;)Z
    17: ifeq          28
    20: aload_0
    21: iconst_1
    22: putfield      #144                // Field _forceClose:Z
    25: goto          45
    28: getstatic     #41                 // Field SQL_ERRORS:Ljava/util/Set;
    31: aload_2
    32: invokeinterface #162,  2          // InterfaceMethod java/util/Set.contains:(Ljava/lang/Object;)Z
    37: ifeq          45
    40: aload_0
    41: iconst_1
    42: putfield      #144                // Field _forceClose:Z
    45: aload_1
    46: areturn

Smart rabbits know that the default inline threshold for a JVM running the server Hotspot compiler is 35 bytecodes. So we gave this routine some love. That early return is costing us, and maybe those conditionals could be combined. Second attempt was this:

String sqlState = sqle.getSQLState();
if (sqlState != null && (sqlState.startsWith("08") || SQL_ERRORS.contains(sqlState)))
    _forceClose = true;
return sqle;

Close but no cigar, one bytecode over the threshold at 36 bytecodes. How about:

String sqlState = sqle.getSQLState();
_forceClose |= (sqlState != null && (sqlState.startsWith("08") || SQL_ERRORS.contains(sqlState)));
return sale;

Looks simpler right? It's actually worse, 45 bytecodes. Final solution:

String sqlState = sqle.getSQLState();
if (sqlState != null)
    _forceClose |= sqlState.startsWith("08") | SQL_ERRORS.contains(sqlState);
return sqle;

Note the binary OR (|) operator usage. A nice hack sacrificing theoretical performance (binary OR is slower in theory) for concrete performance (the code is inlined, making up for it). And the resulting bytecode:

     0: aload_1
     1: invokevirtual #146                // Method java/sql/SQLException.getSQLState:()Ljava/lang/String;
     4: astore_2
     5: aload_2
     6: ifnull        34
     9: aload_0
    10: dup
    11: getfield      #142                // Field _forceClose:Z
    14: aload_2
    15: ldc           #152                // String 08
    17: invokevirtual #154                // Method java/lang/String.startsWith:(Ljava/lang/String;)Z
    20: getstatic     #40                 // Field SQL_ERRORS:Ljava/util/Set;
    23: aload_2
    24: invokeinterface #160,  2          // InterfaceMethod java/util/Set.contains:(Ljava/lang/Object;)Z
    29: ior
    30: ior
    31: putfield      #142                // Field _forceClose:Z
    34: aload_1
    35: areturn

Right under the wire at 35 bytecodes. Small routine, and actually not particularly high-traffic, but you get the idea. Multiply that level of effort across the HikariCP library and you start to get an inkling of why it is fast.

Javassist-generated Delegates

Pretty much every connection pool, dare we say every pool available, has to "wrap" your real Connection, Statement, PreparedStatement, etc. instances and intercept methods like close() so that the Connection isn't actually closed but instead is returned to the pool. Statement and it's subclasses must be wrapped, and SQLException caught and inspected to see if the exception reflects a disconnection that warrants ejecting the Connection from the pool.

What this means is "delegation". The Connection wrapper cares about intercepting close() or execute(sql) for example, but for almost all of the other methods of Connection it simply delegates (exception handling omitted). Something like:

public Clob createClob() {
   return delegate.createClob();
}

The first iteration of HikariCP also did this, and it still provides a "fallback" mode. An interface like PreparedStatement contains some 50+ methods, only 4 of which we are interested in intercepting. Rather than creating a wrapper class that has 50+ "delegate" methods like the above, we use Javassist to generate all of the delegate methods. While this provides no inherent performance increase, it means that our "proxy" (wrapper) class only need contain the overridden methods. The Statement proxy class in HikariCP is only ~160 lines of code including comments, compared to 1100+ lines of code in other pools. This approach is in keeping with our minimalist ethos.

Our delegates perform quite admirably:

Pool Med (ms) Avg (ms) Max (ms)
BoneCP 5049 3249 6929
HikariCP 13 11 58

The fact that even using delegates like everyone else HikariCP achieves 13ms times on this benchmark is more attributable to Hikari's efficient core than anything else.

And yet, looking at the bytecode for all of the delegate methods, with their getfield, checkcast, and invokeinterface op codes, it really touched our nerve. Is it possible to go faster?

Can we actually eliminate delegation itself?


Full-on Insanity

"I've always been mad, I know I've been mad,
like the most of us,
very hard to explain why you're mad,
even if you're not mad..." 
                       - Pink Floyd

But how? How could we eliminate delegation and still intercept the methods we need? Even more, we need to wrap every "delegate" method with a try..catch to interrogate SQLExceptions, which is actually interception now isn't it?

In order to eliminate delegation the user needs to run against the "bare metal" of their driver classes, yet we still need to intercept methods and wrap them with exception handlers. We were already using Javassist to generate our classes for delegation. Why not use Javassist to inject our code directly into the driver's classes?

Play an Instrument

However, the classes must be altered before they are loaded ... because convincing the JVM to reload classes is no trivial task, particularly when you don't own the ClassLoader in question. The answer lay in java.lang.instrument. We built an instrumentation "agent" that "instruments" the driver classes on the fly as they are loaded, injecting our code into them, including try..catch blocks where necessary. The instrumentation agent is dynamically loaded and unloaded so that it doesn't spend time inspecting classes that have nothing to do with JDBC and no need for instrumentation.

Pure Gold

As slim as our "delegate" proxies are, there is still a fair amount of code, especially in the ConnectionProxy class. The prospect of "inlining" the bytecode, or worse source code, into the instrumentation code had a bad smell about it. We've already written the intercept code once in our proxies, can't we just use that somehow? But the code is in our classes, not in the target driver's classes.

This is where we think code can sometimes become art. We created an annotation @HikariInject, and with it we annotate all of the fields and methods that we want injected from the existing proxy classes. The instrumentation agent inspects our proxy classes, and injects fields or methods tagged with @HikariInject into the target driver class -- with some special logic for handling collisions. The pure gold is, the exact same class code that is used in "delegation" mode is the same exact class code that is injected in "instrumentation" mode. There is only one canonical source for both.

The instrumenter is extremely robust, but if there is any kind of failure injecting the code, HikariCP drops back to delegation mode (and logs a message to that effect). The JVM is smart enough to know that if an instrumentation agent throws an exception, the class is loaded cleanly without it -- nothing can be corrupted. Injection takes place at pool startup time, and typically takes only about 200ms.

The result of this is:

Pool Med (ms) Avg (ms) Max (ms)
BoneCP 5049 3249 6929
HikariCP 8 7 13

While going from 13ms (delegates) to 8ms (instrumentation) may not seem like much, it represents a 40% improvement.


Yeah, but still

Still, even without instrumentation, how do we get anywhere near 13ms for 60+ million JDBC API invocations? Well, we're obviously running against a stub implementation, so the JIT is doing a lot of inlining. However, the same inlining at the stub-level is occurring for BoneCP in the benchmark. So, no inherent advantage to us.

But inlining is part of the equation, and I will say that BoneCP has at least 10 methods that are flagged as "hot" by the JVM that the JIT considers too large to inline. And at least two of these are critical path. HikariCP has none. Additionally, some of the features in BoneCP require it to do much more work (I thought "bone" stood for bare-bones, maybe I'm mistaken). Which brings us to another topic...

Scheduler quanta

Some light reading. TL;DR Obviously, when you're running 400 threads "at once", you aren't really running them "at once" unless you have 400 cores. The operating system, using N cores, switches between your threads giving each a small "slice" of time to run called a quanta or quantum.

But with 400 threads, when your time runs out (as a thread) it may be a "long time" before the scheduler gives you a chance to run again. With this many threads, if a thread cannot complete what it needs to get done during its time-slice, well, there is a performance penalty to be paid. And not a small one.

We have combed through HikariCP, crushing and optimizing the critical code paths to ensure they can fully execute 60+ million JDBC invocations within a single "quanta". With of course the exception of a truly blocked condition, such as no available connections.

Which brings us to...

CPU Cache-line Invalidation

Another big hit incurred when you can't get your work done in a quanta is CPU cache-line invalidation. If your thread is preempted by the scheduler, when it does get a chance to run again all of the data it was frequently accessing is likely no longer in the core's L1 or core-pair L2 cache. Even more likely because you have no control over which core you will be scheduled on next.


Can we go faster?

Almost certainly. Our original goal when moving from delegates to instrumentation was to reach sub-millisecond times for 60+ million JDBC API invocations, and that goal still remains. We have some ideas that really get into the esoterica of modern CPU architectures, such as "false sharing".