-
Notifications
You must be signed in to change notification settings - Fork 3k
Down the Rabbit Hole
This is where we give away the recipe to the secret sauce. If we were smart, we'd leave to the those industrious enough to read the code, lest others copy what we have done too easily. But honestly, when you come in with benchmarks like these there is a certain amount of skepticism that must be addressed.
In order to make HikariCP as fast as it is, we went down to bytecode-level engineering. We pulled out every
trick we know to help the JIT help you. We tried to limit key routines to less then the JIT inline-threshold,
we chased down and eliminated as many invokeinterface or invokespecial bytecode operations as
possible, flattened inheritance hierarchies, shadowed member variables, eliminated casts. Trusting no one, we
dissected the JVM classes and replaced where necessary. We studied operating system thread schedulers and JIT
compiler output. We think it shows in our results. But like rust, we never sleep; there are still a few
unexplored paths we intend to go down in the future.
There is not only beauty in simplicity, but if done right, speed.
Pretty much every connection pool, dare we say every pool available, has to "wrap" your real Connection,
Statement, PreparedStatement, etc. instances and intercept methods like close() so that the
Connection isn't actually closed but instead is returned to the pool. Statement and it's subclasses
must be wrapped, and SQLException caught and inspected to see if the exception reflects a disconnection
that warrants ejecting the Connection from the pool.
What this means is "delegation". The Connection wrapper cares about intercepting close() or
execute(sql) for example, but for almost all of the other methods of Connection it simply delegates. Something like:
public Clob createClob() {
return delegate.createClob();
}
The first iteration of HikariCP also did this, and it still provides a "fallback" mode. An interface like
PreparedStatement contains some 50+ methods, only 4 of which we are interested in intercepting. Rather
than creating a wrapper class that has 50+ "delegate" methods like the above, we use Javassist to generate
all of the delegate methods. While this provides no inherent performance increase, it means that our
"proxy" (wrapper) class only need contain the overridden methods. The Statement proxy class in HikariCP
is only ~130 lines of code including comments, compared to 1100+ lines of code in other pools. This
approach is in keeping with our minimalist ethos.
Our delegates perform quite admirably:
| Pool | Med (ms) | Avg (ms) | Max (ms) |
|---|---|---|---|
| BoneCP | 5049 | 3249 | 6929 |
| HikariCP | 13 | 11 | 58 |
And yet, looking at the bytecode for all of the delegate methods, with their getfield, checkcast, and
invokeinterface op codes, it really touched our nerve. Is it possible to go faster?
Can we actually eliminate delegation itself?
"I've always been mad, I know I've been mad,
like the most of us,
very hard to explain why you're mad,
even if you're not mad..."
- Pink Floyd
But how? How could we eliminate delegation and still intercept the methods we need? Even more, we need to
wrap every "delegate" method with a try..catch to interrogate SQLExceptions, which is actually interception
now isn't it?
In order to eliminate delegation the user needs to run against the "bare metal" of their driver classes, yet we still need to intercept methods and wrap them with exception handlers. We were already using Javassist to generate our classes for delegation. Why not use Javassist to inject our code directly into the driver's classes?
However, the classes must be altered before they are loaded ... because convincing the JVM to reload classes is
no trivial task. The answer lay in java.lang.instrument. We built an instrumentation "agent" that
"instruments" the driver classes on the fly as they are loaded, injecting our code into them. The instrumentation
agent is dynamically loaded and unloaded so that it doesn't spend time inspecting classes that have nothing to
do with JDBC and no need for instrumentation.
As slim as our "delegate" proxies are, there is still a fair amount of code, especially in the Connection
proxy. The prospect of "inlining" the bytecode, or worse, source code into the instrumentation code had a bad
smell about it. We've already written the intercept code once in our proxies, can't we just use that somehow?
But the code is in our classes, not in the target driver's classes.
This is where we think code can sometimes become art. We created an annotation @HikariInject, and with it we
annotate all of the fields and methods in the existing proxy classes. The instrumentation agent inspects our
proxy classes, and injects fields or methods tagged with @HikariInject into the target driver class -- with
some special logic for handling collisions. The pure gold is, the exact same class code that is used in
"delegation" mode is the same exact class code that is injected in "instrumentation" mode. There is only one
canonical source for both.
The instrumenter is extremely robust, but if there is any kind of failure injecting the code, HikariCP drops back to delegation mode (and logs a message to that effect). The JVM is smart enough to know that if an instrumentation agent throws an exception, the class is loaded cleanly without it -- nothing can be corrupted. Injection takes place at pool startup time, and typically takes only about 200ms.
The result of this is:
| Pool | Med (ms) | Avg (ms) | Max (ms) |
|---|---|---|---|
| BoneCP | 5049 | 3249 | 6929 |
| HikariCP | 8 | 7 | 13 |
While going from 13ms (delegates) to 8ms (instrumentation) may not seem like much, it represents a 40% improvement.
Still, even without instrumentation, how do we get anywhere near 13ms for 60+ million JDBC API invocations? Well, we're obviously running against a stub implementation, so the JIT is doing a lot of inlining. However, the same inlining at the stub-level is occurring for BoneCP in the benchmark. So, no inherent advantage to us.
But inlining is part of the equation, and I will say that BoneCP has at least 10 methods that are flagged as "hot" by the JVM that the JIT considers too large to inline. And at least two of these are critical path. HikariCP has none. Which brings us to another topic...
Some light reading. TL;DR Obviously, when you're running 400 threads "at once", you aren't really running them "at once" unless you have 400 cores. The operating system, using N cores, switches between your threads giving each a small "slice" of time to run called a quanta or quantum.
But with 400 threads, when your time runs out (as a thread) it may be a "long time" before the scheduler gives you a chance to run again. With this many threads, if a thread cannot complete what it needs to get done during its time-slice, well, there is a performance penalty to be paid. And not a small one.
We have combed through HikariCP, crushing and optimizing the critical code paths to ensure they can fully execute any operation within a "quanta". With of course the exception of a truly blocked condition, such as no available connections. Actually, not just any operation -- 60+ million of them.
The fact is, with JIT inlining and execution path optimizations, a thread invoking against HikariCP can get through all 60+ million JDBC operations in the MixedBench benchmark within a single scheduler quanta.
Put that in your pipe and smoke it!
We don't know. Our original goal when moving from delegates to instrumentation was to reach sub-millisecond times for 60+ million JDBC API invocations. We continue to poke at the problem, and we're not sure where the theoretical maximum actually is, but we feel we're getting close. Maybe we're at "good enough" and it's time to take on another task.