Down the Rabbit Hole

Brett Wooldridge edited this page Sep 23, 2016 · 42 revisions

This is where we give away the recipe to the secret sauce. When you come in with benchmarks like ours there is a certain amount of skepticism that must be addressed. If you think of performance, and of connection pools, you might be tempted into thinking that the pool is the most important part of the performance equation. Not so clearly so. The number of getConnection() operations in comparison to other JDBC operations is small. A large amount of performance gains come in the optimization of the "delegates" that wrap Connection, Statement, etc.

We're in your bytecodez

In order to make HikariCP as fast as it is, we went down to bytecode-level engineering, and beyond. We pulled out every trick we know to help the JIT help you. We studied the bytecode output of the compiler, and even the assembly output of the JIT to limit key routines to less than the JIT inline-threshold. We flattened inheritance hierarchies, shadowed member variables, eliminated casts.

Micro-optimizations

HikariCP contains many micro-optimizations that individually are barely measurable, but together combine as a boost to overall performance. Some of these optimizations are measured in fractions of a millisecond amortized over millions of invocations.

ArrayList

One non-trivial (performance-wise) optimization was eliminating the use of an ArrayList<Statement> instance in the ConnectionProxy used to track open Statement instances. When a Statement is closed, it must be removed from this collection, and when the Connection is closed it must iterate the collection and close any open Statement instances, and finally must clear the collection. The Java ArrayList, wisely for general purpose use, performs a range check upon every get(int index) call. However, because we can provide guarantees about our ranges, this check is merely overhead. Additionally, the remove(Object) implementation performs a scan from head to tail, however a common pattern in JDBC programming is to close Statements either immediately after use or in reverse order of opening. For these cases, a scan that starts at the tail will perform better. Therefore, ArrayList<Statement> was replaced with a custom class FastList which eliminates range checking and performs removal scans from tail to head.

ConcurrentBag

HikariCP contains a custom lock-free collection called a ConcurrentBag. The idea was borrowed from the C# .NET ConcurrentBag class. The ConcurrentBag provides ThreadLocal caching as well as queue-stealing in a lock-free design providing a high degree of concurrency and minimized occurrences of false-sharing.

Invocation: invokevirtual vs invokestatic

In order to generate proxies for Connection, Statement, and ResultSet instances HikariCP was initially using a singleton factory, held in the case of ConnectionProxy in a static field (PROXY_FACTORY).

There was a dozen or so methods resembling the following:

public final PreparedStatement prepareStatement(String sql, String[] columnNames) throws SQLException
{
    return PROXY_FACTORY.getProxyPreparedStatement(this, delegate.prepareStatement(sql, columnNames));
}

Using the original singleton factory, the generated bytecode looked like this:

    public final java.sql.PreparedStatement prepareStatement(java.lang.String, java.lang.String[]) throws java.sql.SQLException;
    flags: ACC_PRIVATE, ACC_FINAL
    Code:
      stack=5, locals=3, args_size=3
         0: getstatic     #59                 // Field PROXY_FACTORY:Lcom/zaxxer/hikari/proxy/ProxyFactory;
         3: aload_0
         4: aload_0
         5: getfield      #3                  // Field delegate:Ljava/sql/Connection;
         8: aload_1
         9: aload_2
        10: invokeinterface #74,  3           // InterfaceMethod java/sql/Connection.prepareStatement:(Ljava/lang/String;[Ljava/lang/String;)Ljava/sql/PreparedStatement;
        15: invokevirtual #69                 // Method com/zaxxer/hikari/proxy/ProxyFactory.getProxyPreparedStatement:(Lcom/zaxxer/hikari/proxy/ConnectionProxy;Ljava/sql/PreparedStatement;)Ljava/sql/PreparedStatement;
        18: return

You can see that first there is a getstatic call to get the value of the static field PROXY_FACTORY, as well as (lastly) the invokevirtual call to getProxyPreparedStatement() on the ProxyFactory instance.

We eliminated the singleton factory (which was generated by Javassist) and replaced it with a final class having static methods (whose bodies are injected by Javassist). The Java code became:

    public final PreparedStatement prepareStatement(String sql, String[] columnNames) throws SQLException
    {
        return ProxyFactory.getProxyPreparedStatement(this, delegate.prepareStatement(sql, columnNames));
    }

Where getProxyPreparedStatement() is a static method defined in the ProxyFactory class. The resulting bytecode is:

    private final java.sql.PreparedStatement prepareStatement(java.lang.String, java.lang.String[]) throws java.sql.SQLException;
    flags: ACC_PRIVATE, ACC_FINAL
    Code:
      stack=4, locals=3, args_size=3
         0: aload_0
         1: aload_0
         2: getfield      #3                  // Field delegate:Ljava/sql/Connection;
         5: aload_1
         6: aload_2
         7: invokeinterface #72,  3           // InterfaceMethod java/sql/Connection.prepareStatement:(Ljava/lang/String;[Ljava/lang/String;)Ljava/sql/PreparedStatement;
        12: invokestatic  #67                 // Method com/zaxxer/hikari/proxy/ProxyFactory.getProxyPreparedStatement:(Lcom/zaxxer/hikari/proxy/ConnectionProxy;Ljava/sql/PreparedStatement;)Ljava/sql/PreparedStatement;
        15: areturn

There are three things of note here. The getstatic call is gone. The invokevirtual call is replaced with a invokestatic call that is more easily optimized by the JVM. Lastly, possibly not noticed at first glance is that the stack size is reduced from 5 elements to 4 elements. This is because in the case of invokevirtual there is an implicit passing of the instance of ProxyFactory on the stack, and there is an additional (unseen) pop of that value from the stack when getProxyPreparedStatement() was called. In all, this change removed a static field access, a push and pop from the stack, and made the invocation easier for the JIT to optimize because the callsite is guaranteed not to change.


Yeah, but still

In our benchmark we are obviously running against a stub driver implementation, so the JIT is doing a lot of inlining. However, the same inlining at the stub-level is occurring for other pools in the benchmark. So, no inherent advantage to us.

But inlining is certainly a big part of the equation even when real drivers are in use, which brings us to another topic...

Scheduler quanta

Some light reading. TL;DR Obviously, when you're running 400 threads "at once", you aren't really running them "at once" unless you have 400 cores. The operating system, using N cores, switches between your threads giving each a small "slice" of time to run called a quantum.

But with 400 threads, when your time runs out (as a thread) it may be a "long time" before the scheduler gives you a chance to run again. With this many threads, it is crucial that a thread get as much as possible done during its time-slice, and avoid locks that force it to give up that time-slace, otherwise there is a performance penalty to be paid. And not a small one.

Which brings us to...

CPU Cache-line Invalidation

Another big hit incurred when you can't get your work done in a quanta is CPU cache-line invalidation. If your thread is preempted by the scheduler, when it does get a chance to run again all of the data it was frequently accessing is likely no longer in the core's L1 or core-pair L2 cache. Even more likely because you have no control over which core you will be scheduled on next.