-
Notifications
You must be signed in to change notification settings - Fork 52
Description
This explains the adaptive, specializing interpreter mentioned as the core plan of stage 1 in https://github.com/markshannon/faster-cpython/blob/master/plan.md
This requires and builds on the quickening interpreter #27.
Overview
We exploit the capability of the quickening interpreter to support self-modifying code to add specialization "families" of bytecodes for those bytecodes which would gain most from specialization. The bytecodes within a "family" can replace themselves with other members of the family at runtime. This allows us to perform quite aggressive speculative specialization, since de-optimization is cheap.
For each "plain" bytecode that would benefit from specialization we add an "adaptive" bytecode and several specialized forms.
During the quickening phase the "plain" bytecodes are replaced with the "adaptive" bytecode. At runtime the "adaptive" bytecode maintains a counter and when that counter reaches zero it attempts to specialize itself. Otherwise it jumps to the "plain" bytecode to perform the operation. Should specialization fail, the counter is set to some appropriate number, say 20.
The adaptive forms are coded roughly as:
if counter == 0:
nexti -= 1 # Dispatch will jump to specialized version
specialize(instructions[nexti])
else:
counter -= 1
goto PLAIN_BYTECODE
The specialized forms are coded roughly as:
if predicate(obj):
saturating_increment(counter)
do_specialized_operation()
else:
saturating_decrement(counter)
if counter == 0:
replace_self_with_adaptive_form()
goto PLAIN_BYTECODE
Equivalence to specializing, optimizing VM where the region of optimization is one bytecode
Specializing, optimizing VMs work by selecting regions of code to optimize and compile.
V8 optimizes whole functions. PyPy and luajit optimize loopy traces. HHVM optimizes tracelets, then mutliple tracelets stitched together.
By specializing just one bytecode we do not need JIT compilation. We can ahead-of-time compile the specializations by implementing them in C. Obviously this will not perform anything like as well as using larger regions, but we can still achieve worthwhile speedups.
Example
Using LOAD_GLOBAL
as an example, we add three new bytecodes: LOAD_GLOBAL_ADAPTIVE
, LOAD_GLOBAL_MODULE
, LOAD_GLOBAL_BUILTIN
.
LOAD_GLOBAL_ADAPTIVE
is implemented as above.
LOAD_GLOBAL_MODULE
is a specialization for when the value is stored in the module's dictionary.
if (UNLIKELY(globals->version != EXPECTED_VERSION)) {
goto maybe_deopt_load_global_module; // Handles counter, etc.
}
saturating_increment(counter);
INCREF(cached_value);
PUSH(cached_value);
LOAD_GLOBAL_BUILTIN
is a specialization for when the value is stored in the builtin's dictionary.
if (UNLIKELY(globals->version != EXPECTED_VERSION1)) {
goto maybe_deopt_load_global_builtin; // Handles counter, etc.
}
if (UNLIKELY(builtins->version != EXPECTED_VERSION2)) {
goto maybe_deopt_load_global_builtin;
}
saturating_increment(counter);
INCREF(cached_value);
PUSH(cached_value);
Specialization of LOAD_GLOBAL_ADAPTIVE
determines whether the value is in the module's dictionary of builtin's, records the expected version numbers and the expected value, then converts the bytecode to LOAD_GLOBAL_MODULE
or LOAD_GLOBAL_BUILTIN
.
Note:
The above uses dictionary versions. It is possible to use the dictionary keys to allow a specialization that works when a global variable is present, but that is out of scope for this issue.
Implementation
Alongside the array of quickened instructions we need a supporting data structure holding counts, expected values and versions,
original operands, and cached values.
During quickening, memory for the specialized instructions must be allocated in such a way that it can be accessed quickly during execution.
Possible implementations
Entries:
- Fixed size entries, all bytecodes use the same sized data structure. This is inflexible but simple.
- Variable sized entries, flexible but more complex. Note that entries can vary in size across families, not within them. The size of the data structure is fixed for a given family of bytecodes.
Indexing:
- Index by external array. Maps instruction offset to offset in data array. Slow as it requires extra indirection.
- Index stores as oparg. Fast, but limits number of entries to 256, or fewer if variable sized entries are used.
- Index as function of instruction offset and oparg. A bit slower that 2, but allows near infinite number of entries.
Layout:
- Two arrays. Requires two registers for indexing implementation 2, but three registers for indexing implementation 3.
- Back-to-back array. Instruction go forwards in memory from a shared pointer, data goes backwards. Requires two for indexing implementation 3.
Preferred implementation
The best (fastest in general and not the simplest) approach is to allow variable sized entries, indexed by a function of offset and oparg and use back-to-back arrays.
To get the data for a bytecode would be something like:
DataItem *data = &((DataItem *)instruction_base)[-(nexti>>1)-oparg]
where nexti
is the index of the next instruction and assuming that, on average, each instruction needs half a data item (approx 25% instructions need data, and that they need 2 items on average).
Indexing 3 should be faster than using indexing 2 when combined with back-to-back arrays and variable sized entries, because it reduces register use, and is more compact. The additional ALU operation(s) required shouldn't matter too much.
On a 64bit machine, a DataItem
would be 8 bytes.