Skip to content

Down the Rabbit Hole

brettwooldridge edited this page Jan 15, 2014 · 48 revisions

This is where we give away the recipe to the secret sauce. When you come in with benchmarks like ours there is a certain amount of skepticism that must be addressed. If you think of performance, and of connection pools, you might be tempted into thinking that the pool is the most important part of the performance equation. Not so clearly so. The number of getConnection() operations in comparison to other JDBC operations is small. A large amount of performance gains come in the optimization of the "delegates" that wrap Connection, Statement, ResultSet, etc.

We're in your bytecodez

In order to make HikariCP as fast as it is, we went down to bytecode-level engineering, and beyond. We pulled out every trick we know to help the JIT help you. We studied the bytecode output of the compiler, and even the assembly output of the JIT to limit key routines to less than the JIT inline-threshold. We flattened inheritance hierarchies, shadowed member variables, eliminated casts.

Sometimes seeing that a routine was surprisingly over the inline-threshold, we would figure out how to squeeze a few extra bytecodes out. Take this simple example:

    public SQLException checkException(SQLException sqle) {
        String sqlState = sqle.getSQLState();
        if (sqlState == null)
            return sqle;

        if (sqlState.startsWith("08"))
            _forceClose = true;
        else if (SQL_ERRORS.contains(sqlState))
            _forceClose = true;
        return sqle;
    }

Simple enough method, checking if the SQLSTATE of an exception indicates a disconnection error. Here is the bytecode:

         0: aload_1
         1: invokevirtual #148                // Method java/sql/SQLException.getSQLState:()Ljava/lang/String;
         4: astore_2
         5: aload_2
         6: ifnonnull     11
         9: aload_1
        10: areturn
        11: aload_2
        12: ldc           #154                // String 08
        14: invokevirtual #156                // Method java/lang/String.startsWith:(Ljava/lang/String;)Z
        17: ifeq          28
        20: aload_0
        21: iconst_1
        22: putfield      #144                // Field _forceClose:Z
        25: goto          45
        28: getstatic     #41                 // Field SQL_ERRORS:Ljava/util/Set;
        31: aload_2
        32: invokeinterface #162,  2          // InterfaceMethod java/util/Set.contains:(Ljava/lang/Object;)Z
        37: ifeq          45
        40: aload_0
        41: iconst_1
        42: putfield      #144                // Field _forceClose:Z
        45: aload_1
        46: return

Smart rabbits know that the default inline threshold for a JVM running the server Hotspot compiler is 35 bytecodes. So we gave this routine some love. That early return is costing us, and maybe those conditionals could be combined. Second attempt was this:

    String sqlState = sqle.getSQLState();
    if (sqlState != null && (sqlState.startsWith("08") || SQL_ERRORS.contains(sqlState)))
        _forceClose = true;
    return sqle;

Close but no cigar, one bytecode over the threshold at 36 bytecodes. How about:

    String sqlState = sqle.getSQLState();
    _forceClose |= (sqlState != null && (sqlState.startsWith("08") || SQL_ERRORS.contains(sqlState)));
    return sale;

Looks simpler right? It's actually worse, 45 bytecodes. Final solution:

    String sqlState = sqle.getSQLState();
    if (sqlState != null)
        _forceClose |= sqlState.startsWith("08") | SQL_ERRORS.contains(sqlState);
    return sqle;

Note the binary OR (|) operator usage. A nice hack sacrificing theoretical performance (binary OR is slower in theory, because it does not short-circuit) for concrete performance (the code is inlined, making up for it). And the resulting bytecode:

         0: aload_1
         1: invokevirtual #153                // Method java/sql/SQLException.getSQLState:()Ljava/lang/String;
         4: astore_2
         5: aload_2
         6: ifnull        34
         9: aload_0
        10: dup
        11: getfield      #149                // Field forceClose:Z
        14: aload_2
        15: ldc           #157                // String 08
        17: invokevirtual #159                // Method java/lang/String.startsWith:(Ljava/lang/String;)Z
        20: getstatic     #37                 // Field SQL_ERRORS:Ljava/util/Set;
        23: aload_2
        24: invokeinterface #165,  2          // InterfaceMethod java/util/Set.contains:(Ljava/lang/Object;)Z
        29: ior
        30: ior
        31: putfield      #149                // Field forceClose:Z
        34: return

Right under the wire at 35 byte codes (zero-based). Small routine, and actually not particularly high-traffic, but you get the idea. Multiply that level of effort across the HikariCP library and you start to get an inkling of why it is fast.

Javassist-generated Delegates

Pretty much every connection pool, dare we say every pool available, has to "wrap" your real Connection, Statement, PreparedStatement, etc. instances and intercept methods like close() so that the Connection isn't actually closed but instead is returned to the pool. Statement and it's subclasses must be wrapped, and SQLException caught and inspected to see if the exception reflects a disconnection that warrants ejecting the Connection from the pool.

What this means is "delegation". The Connection wrapper cares about intercepting close() or execute(sql) for example, but for almost all of the other methods of Connection it simply delegates. Something like (exception handling omitted):

public Clob createClob() {
   return delegate.createClob();
}

An interface like PreparedStatement contains some 50+ methods, only 4 of which we are interested in intercepting. Rather than creating a wrapper class that has 50+ "delegate" methods like the above, we use Javassist to generate all of the delegate methods. While this provides no inherent performance increase, it means that our "proxy" (wrapper) class only need contain the overridden methods. The Statement proxy class in HikariCP is only ~100 lines of code excluding comments, compared to 1100+ lines of code in other pools. This approach is in keeping with our minimalist ethos.

Our delegates perform quite admirably:

Pool Med (ms) Avg (ms) Max (ms)
BoneCP 4635 3060 6747
HikariCP 11 9 28

The fact that even using delegates like everyone else HikariCP achieves 11ms times on this benchmark is more attributable to Hikari's efficient core than anything else.

Micro-optimizations

HikariCP contains many micro-optimizations that individually are barely measurable, but together combine as a boost to overall performance. Some of these optimizations are measured in fractions of a millisecond amortized over millions of invocations.

ArrayList

One non-trivial (performance-wise) optimization was eliminating the use of an ArrayList<Statement> instance in the ConnectionProxy used to track open Statement instances. When a Statement is closed, it must be removed from this collection, and when the Connection is closed it must iterate the collection and close any open Statement instances, and finally must clear the collection. The Java ArrayList, wisely for general purpose use, performs a range check upon every get(int index) call. However, because we can provide guarantees about our ranges, this check is merely overhead. Additionally, the remove(Object) implementation performs a scan from head to tail, however a common pattern in JDBC programming is to close Statements either immediately after use or in reverse order of opening. For these cases, a scan that starts at the tail will perform better. Therefore, ArrayList<Statement> was replaced with a custom class FastStatementList which eliminates range checking and performs removal scans from tail to head.

Invocation: invokevirtual vs invokestatic

In order to generate proxies for Connection, Statement, and ResultSet instances HikariCP was initially using a singleton factory, held in the case of ConnectionProxy in a static field (PROXY_FACTORY).

There was a dozen or so methods resembling the following:

private final PreparedStatement __prepareStatement(String sql, String[] columnNames) throws SQLException
{
    return PROXY_FACTORY.getProxyPreparedStatement(this, delegate.prepareStatement(sql, columnNames));
}

Using the original singleton factory, the generated bytecode looked like this:

    private final java.sql.PreparedStatement __prepareStatement(java.lang.String, java.lang.String[]) throws java.sql.SQLException;
    flags: ACC_PRIVATE, ACC_FINAL
    Code:
      stack=5, locals=3, args_size=3
         0: getstatic     #59                 // Field PROXY_FACTORY:Lcom/zaxxer/hikari/proxy/ProxyFactory;
         3: aload_0
         4: aload_0
         5: getfield      #3                  // Field delegate:Ljava/sql/Connection;
         8: aload_1
         9: aload_2
        10: invokeinterface #74,  3           // InterfaceMethod java/sql/Connection.prepareStatement:(Ljava/lang/String;[Ljava/lang/String;)Ljava/sql/PreparedStatement;
        15: invokevirtual #69                 // Method com/zaxxer/hikari/proxy/ProxyFactory.getProxyPreparedStatement:(Lcom/zaxxer/hikari/proxy/ConnectionProxy;Ljava/sql/PreparedStatement;)Ljava/sql/PreparedStatement;
        18: return

You can see that first there is a getstatic call to get the value of the static field PROXY_FACTORY, as well as (lastly) the invokevirtual call to getProxyPreparedStatement() on the ProxyFactory instance.

We eliminated the singleton factory (which was generated by Javassist) and replaced it with a final class having static methods (whose bodies are injected by Javassist). The Java code became:

    private final PreparedStatement __prepareStatement(String sql, String[] columnNames) throws SQLException
    {
        return ProxyFactory.getProxyPreparedStatement(this, delegate.prepareStatement(sql, columnNames));
    }

Where getProxyPreparedStatement() is a static method defined in the ProxyFactory class. The resulting bytecode is:

    private final java.sql.PreparedStatement __prepareStatement(java.lang.String, java.lang.String[]) throws java.sql.SQLException;
    flags: ACC_PRIVATE, ACC_FINAL
    Code:
      stack=4, locals=3, args_size=3
         0: aload_0
         1: aload_0
         2: getfield      #3                  // Field delegate:Ljava/sql/Connection;
         5: aload_1
         6: aload_2
         7: invokeinterface #72,  3           // InterfaceMethod java/sql/Connection.prepareStatement:(Ljava/lang/String;[Ljava/lang/String;)Ljava/sql/PreparedStatement;
        12: invokestatic  #67                 // Method com/zaxxer/hikari/proxy/ProxyFactory.getProxyPreparedStatement:(Lcom/zaxxer/hikari/proxy/ConnectionProxy;Ljava/sql/PreparedStatement;)Ljava/sql/PreparedStatement;
        15: areturn

There are three things of note here. The getstatic call is gone. The invokevirtual call is replaced with a invokestatic call that is more easily optimized by the JVM. Lastly, possibly not noticed at first glance is that the stack size is reduced from 5 elements to 4 elements. This is because in the case of invokevirtual there is an implicit passing of the instance of ProxyFactory on the stack, and there is an additional (unseen) pop of that value from the stack when getProxyPreparedStatement() was called. In all, this change removed a static field access, a push and pop from the stack, and made the invocation easier for the JIT to optimize because the callsite is guaranteed not to change.


Yeah, but still

Still, how do we get anywhere near 13ms for 60+ million JDBC API invocations? Well, we're obviously running against a stub implementation, so the JIT is doing a lot of inlining. However, the same inlining at the stub-level is occurring for BoneCP in the benchmark. So, no inherent advantage to us.

But inlining is part of the equation, and we will say that BoneCP has at least 10 methods that are flagged as "hot" by the JVM that the JIT considers too large to inline. And at least two of these are critical path. HikariCP has none. Additionally, some of the features in BoneCP require it to do much more work. Which brings us to another topic...

Scheduler quanta

Some light reading. TL;DR Obviously, when you're running 400 threads "at once", you aren't really running them "at once" unless you have 400 cores. The operating system, using N cores, switches between your threads giving each a small "slice" of time to run called a quantum.

But with 400 threads, when your time runs out (as a thread) it may be a "long time" before the scheduler gives you a chance to run again. With this many threads, if a thread cannot complete what it needs to get done during its time-slice, well, there is a performance penalty to be paid. And not a small one.

In the MixedBench benchmark, because we have combed through HikariCP, crushing and optimizing the critical code paths, HikariCP can fully execute 60+ million JDBC invocations against the stub-JDBC classes within a single "quantum".

Which brings us to...

CPU Cache-line Invalidation

Another big hit incurred when you can't get your work done in a quanta is CPU cache-line invalidation. If your thread is preempted by the scheduler, when it does get a chance to run again all of the data it was frequently accessing is likely no longer in the core's L1 or core-pair L2 cache. Even more likely because you have no control over which core you will be scheduled on next.


Can we go faster?

Probably. Our original goal was to reach low single digit millisecond times or even sub-millisecond times for 60+ million JDBC API invocations, and that goal still remains.