Skip to content

Adaptive, specializing interpreter #28

@markshannon

Description

@markshannon

This explains the adaptive, specializing interpreter mentioned as the core plan of stage 1 in https://github.com/markshannon/faster-cpython/blob/master/plan.md
This requires and builds on the quickening interpreter #27.

Overview

We exploit the capability of the quickening interpreter to support self-modifying code to add specialization "families" of bytecodes for those bytecodes which would gain most from specialization. The bytecodes within a "family" can replace themselves with other members of the family at runtime. This allows us to perform quite aggressive speculative specialization, since de-optimization is cheap.

For each "plain" bytecode that would benefit from specialization we add an "adaptive" bytecode and several specialized forms.
During the quickening phase the "plain" bytecodes are replaced with the "adaptive" bytecode. At runtime the "adaptive" bytecode maintains a counter and when that counter reaches zero it attempts to specialize itself. Otherwise it jumps to the "plain" bytecode to perform the operation. Should specialization fail, the counter is set to some appropriate number, say 20.

The adaptive forms are coded roughly as:

if counter == 0:
    nexti -= 1 # Dispatch will jump to specialized version
    specialize(instructions[nexti])
else:
   counter -= 1
   goto PLAIN_BYTECODE

The specialized forms are coded roughly as:

if predicate(obj):
     saturating_increment(counter)
     do_specialized_operation()
else:
     saturating_decrement(counter)
     if counter == 0:
         replace_self_with_adaptive_form()
     goto PLAIN_BYTECODE

Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Specializing, optimizing VMs work by selecting regions of code to optimize and compile.
V8 optimizes whole functions. PyPy and luajit optimize loopy traces. HHVM optimizes tracelets, then mutliple tracelets stitched together.

By specializing just one bytecode we do not need JIT compilation. We can ahead-of-time compile the specializations by implementing them in C. Obviously this will not perform anything like as well as using larger regions, but we can still achieve worthwhile speedups.

Example

Using LOAD_GLOBAL as an example, we add three new bytecodes: LOAD_GLOBAL_ADAPTIVE, LOAD_GLOBAL_MODULE, LOAD_GLOBAL_BUILTIN.

LOAD_GLOBAL_ADAPTIVE is implemented as above.

LOAD_GLOBAL_MODULE is a specialization for when the value is stored in the module's dictionary.

    if (UNLIKELY(globals->version != EXPECTED_VERSION)) {
         goto maybe_deopt_load_global_module; // Handles counter, etc.
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);

LOAD_GLOBAL_BUILTIN is a specialization for when the value is stored in the builtin's dictionary.

    if (UNLIKELY(globals->version != EXPECTED_VERSION1)) {
         goto maybe_deopt_load_global_builtin; // Handles counter, etc.
    }
    if (UNLIKELY(builtins->version != EXPECTED_VERSION2)) {
        goto maybe_deopt_load_global_builtin;
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);

Specialization of LOAD_GLOBAL_ADAPTIVE determines whether the value is in the module's dictionary of builtin's, records the expected version numbers and the expected value, then converts the bytecode to LOAD_GLOBAL_MODULE or LOAD_GLOBAL_BUILTIN.

Note:

The above uses dictionary versions. It is possible to use the dictionary keys to allow a specialization that works when a global variable is present, but that is out of scope for this issue.

Implementation

Alongside the array of quickened instructions we need a supporting data structure holding counts, expected values and versions,
original operands, and cached values.
During quickening, memory for the specialized instructions must be allocated in such a way that it can be accessed quickly during execution.

Possible implementations

Entries:

  1. Fixed size entries, all bytecodes use the same sized data structure. This is inflexible but simple.
  2. Variable sized entries, flexible but more complex. Note that entries can vary in size across families, not within them. The size of the data structure is fixed for a given family of bytecodes.

Indexing:

  1. Index by external array. Maps instruction offset to offset in data array. Slow as it requires extra indirection.
  2. Index stores as oparg. Fast, but limits number of entries to 256, or fewer if variable sized entries are used.
  3. Index as function of instruction offset and oparg. A bit slower that 2, but allows near infinite number of entries.

Layout:

  1. Two arrays. Requires two registers for indexing implementation 2, but three registers for indexing implementation 3.
  2. Back-to-back array. Instruction go forwards in memory from a shared pointer, data goes backwards. Requires two for indexing implementation 3.

Preferred implementation

The best (fastest in general and not the simplest) approach is to allow variable sized entries, indexed by a function of offset and oparg and use back-to-back arrays.
To get the data for a bytecode would be something like:

DataItem *data = &((DataItem *)instruction_base)[-(nexti>>1)-oparg]

where nexti is the index of the next instruction and assuming that, on average, each instruction needs half a data item (approx 25% instructions need data, and that they need 2 items on average).

Indexing 3 should be faster than using indexing 2 when combined with back-to-back arrays and variable sized entries, because it reduces register use, and is more compact. The additional ALU operation(s) required shouldn't matter too much.

On a 64bit machine, a DataItem would be 8 bytes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions