Adaptive, specializing interpreter

This explains the adaptive, specializing interpreter mentioned as the core plan of stage 1 in https://github.com/markshannon/faster-cpython/blob/master/plan.md
This requires and builds on the quickening interpreter https://github.com/gvanrossum/speed/issues/27.

## Overview

We exploit the capability of the quickening interpreter to support self-modifying code to add specialization "families" of bytecodes for those bytecodes which would gain most from specialization. The bytecodes within a "family" can replace themselves with other members of the family at runtime. This allows us to perform quite aggressive speculative specialization, since de-optimization is cheap.

For each "plain" bytecode that would benefit from specialization we add an "adaptive" bytecode and several specialized forms.
During the quickening phase the "plain" bytecodes are replaced with the "adaptive" bytecode. At runtime the "adaptive" bytecode maintains a counter and when that counter reaches zero it attempts to specialize itself. Otherwise it jumps to the "plain" bytecode to perform the operation. Should specialization fail, the counter is set to some appropriate number, say 20.

The adaptive forms are coded roughly as:
```python
if counter == 0:
    nexti -= 1 # Dispatch will jump to specialized version
    specialize(instructions[nexti])
else:
   counter -= 1
   goto PLAIN_BYTECODE
```
The specialized forms are coded roughly as:
```python
if predicate(obj):
     saturating_increment(counter)
     do_specialized_operation()
else:
     saturating_decrement(counter)
     if counter == 0:
         replace_self_with_adaptive_form()
     goto PLAIN_BYTECODE
```

### Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Specializing, optimizing VMs work by selecting regions of code to optimize and compile.
V8 optimizes whole functions. PyPy and luajit optimize loopy traces. HHVM optimizes tracelets, then mutliple tracelets stitched together. 

By specializing just one bytecode we do not need JIT compilation. We can ahead-of-time compile the specializations by implementing them in C. Obviously this will not perform anything like as well as using larger regions, but we can still achieve worthwhile speedups.

### Example

Using `LOAD_GLOBAL` as an example, we add three new bytecodes: `LOAD_GLOBAL_ADAPTIVE`, `LOAD_GLOBAL_MODULE`, `LOAD_GLOBAL_BUILTIN`.

`LOAD_GLOBAL_ADAPTIVE` is implemented as above.

`LOAD_GLOBAL_MODULE` is a specialization for when the value is stored in the module's dictionary.
```C
    if (UNLIKELY(globals->version != EXPECTED_VERSION)) {
         goto maybe_deopt_load_global_module; // Handles counter, etc.
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);
```

`LOAD_GLOBAL_BUILTIN` is a specialization for when the value is stored in the builtin's dictionary.
```C
    if (UNLIKELY(globals->version != EXPECTED_VERSION1)) {
         goto maybe_deopt_load_global_builtin; // Handles counter, etc.
    }
    if (UNLIKELY(builtins->version != EXPECTED_VERSION2)) {
        goto maybe_deopt_load_global_builtin;
    }
    saturating_increment(counter);
    INCREF(cached_value);
    PUSH(cached_value);
```

Specialization of `LOAD_GLOBAL_ADAPTIVE` determines whether the value is in the module's dictionary of builtin's, records the expected version numbers and the expected value, then converts the bytecode to `LOAD_GLOBAL_MODULE`  or `LOAD_GLOBAL_BUILTIN`.

Note:
> The above uses dictionary versions. It is possible to use the dictionary keys to allow a specialization that works when a global variable is present, but that is out of scope for this issue.


## Implementation

Alongside the array of quickened instructions we need a supporting data structure holding counts, expected values and versions,
original operands, and cached values.
During quickening, memory for the specialized instructions must be allocated in such a way that it can be accessed quickly during execution.

### Possible implementations

Entries:
1. Fixed size entries, all bytecodes use the same sized data structure. This is inflexible but simple.
2. Variable sized entries, flexible but more complex. Note that entries can vary in size across families, not within them. The size of the data structure is fixed for a given family of bytecodes.

Indexing:
1. Index by external array. Maps instruction offset to offset in data array. Slow as it requires extra indirection.
2. Index stores as oparg. Fast, but limits number of entries to 256, or fewer if variable sized entries are used.
3. Index as function of instruction offset and oparg. A bit slower that 2, but allows near infinite number of entries.

Layout:
1. Two arrays. Requires two registers for indexing implementation 2, but three registers for indexing implementation 3.
2. Back-to-back array. Instruction go forwards in memory from a shared pointer, data goes backwards. Requires two for indexing implementation 3.

### Preferred implementation

The best (fastest in general and not the simplest) approach is to allow variable sized entries, indexed by a function of offset and oparg and use back-to-back arrays.
To get the data for a bytecode would be something like:
```
DataItem *data = &((DataItem *)instruction_base)[-(nexti>>1)-oparg]
```
where `nexti` is the index of the next instruction and assuming that, on average, each instruction needs half a data item (approx 25% instructions need data, and that they need 2 items on average).

Indexing 3 should be faster than using indexing 2 when combined with back-to-back arrays and variable sized entries, because it reduces register use, and is more compact. The additional ALU operation(s) required shouldn't matter too much.

On a 64bit machine, a `DataItem` would be 8 bytes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive, specializing interpreter #28

Overview

Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Example

Implementation

Possible implementations

Preferred implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Adaptive, specializing interpreter #28

Description

Overview

Equivalence to specializing, optimizing VM where the region of optimization is one bytecode

Example

Implementation

Possible implementations

Preferred implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions