Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro-op (tier 2) interpreter #580

Open
markshannon opened this issue May 16, 2023 · 6 comments
Open

Micro-op (tier 2) interpreter #580

markshannon opened this issue May 16, 2023 · 6 comments
Labels
epic-tier2-optimizer Linear code region optimizer for 3.13 and beyond.

Comments

@markshannon
Copy link
Member

The output of the tier 2 optimization pipeline is a superblock of micro-ops.
We will need an interpreter for these micro-ops for a few reasons:

  • To execute the micro-ops if no JIT compiler is present.
  • To serve as a specification for the JIT
  • To help debug or verify the JIT

Instructions

To execute the micro-ops they will need to be represented in memory in an efficient format.
We can reject rare or complex bytecodes when optimizing, so we can ignore many corner cases.

Each instruction should consist of an opcode and oparg, as they do in normal bytecode, but the oparg can be either the original oparg or a cache entry, so we will need more than 8 bits.
A 32bit instruction with 8 bit opcode, 22 oparg and 2 spare bits should be a good starting pointing. We might want a 9 or 10 bit opcode if we are to have a large number of superinstructions, or we might need 24 bit operands. We'll see.

The interpreter

The interpreter should be created from bytecodes.c, much as we do for PyEval_EvalDefault()

Changes to bytecodes.c

In order to create the interpreter each micro-op can take at most one operand: either the original oparg or one cache entry.
To do this we will need to break down some instructions into quite small parts.

E.g. LOAD_ATTR_METHOD_WITH_VALUES has the signature:
inst(LOAD_ATTR_METHOD_WITH_VALUES, (unused/1, type_version/2, keys_version/2, descr/4, self -- res2 if (oparg & 1), res))

We can ignore the oparg as the optimizer will eliminate it. That leaves the type_version, keys_version and descr operand, so this instruction will be need split into at least three parts, something like:

macro(LOAD_ATTR_METHOD_WITH_VALUES) = 
    SKIP_COUNTER + CHECK_TYPE_VERSION + 
    CHECK_SHARED_DICT_KEYS_VERSION + LOAD_INLINED_ATTRIBUTE;

Where the parts would be defined as something like:

op(SKIPCOUNTER, (unused/1 -- )) {}

op(CHECK_TYPE_VERSION, (type_version/2, self -- self))
{
    PyTypeObject *self_cls = Py_TYPE(self);
    assert(type_version != 0);
    DEOPT_IF(self_cls->tp_version_tag != type_version);
    assert(self_cls->tp_flags & Py_TPFLAGS_MANAGED_DICT);
}

op(CHECK_SHARED_DICT_KEYS_VERSION, (keys_version/2, self -- self))
{
    PyTypeObject *self_cls = Py_TYPE(self);
    PyDictOrValues dorv = *_PyObject_DictOrValuesPointer(self);
    DEOPT_IF(!_PyDictOrValues_IsValues(dorv), LOAD_ATTR);
    PyHeapTypeObject *self_heap_type = (PyHeapTypeObject *)self_cls;
    DEOPT_IF(self_heap_type->ht_cached_keys->dk_version != keys_version);
    STAT_INC(opcode, hit);
}

op(LOAD_INLINED_ATTRIBUTE, (descr/4, self -- res, self))
{
    assert(descr != NULL);
    res = Py_NewRef(descr);
    assert(_PyType_HasFeature(Py_TYPE(res), Py_TPFLAGS_METHOD_DESCRIPTOR));
}

Fitting the 64 bit descr field into 22 bits is difficult, if not impossible.
However, we need to shrink this entry in order to reduce the size of code objects.
So we'll need to do that first, or make the instructions 64 bit each.

@markshannon markshannon added the epic-tier2-optimizer Linear code region optimizer for 3.13 and beyond. label May 16, 2023
@iritkatriel
Copy link
Collaborator

Why can't the micro-ops have caches // variable length inputs too?

@markshannon
Copy link
Member Author

It is not an absolute requirement, but having only a single operand per instruction has a number of advantages.

  • It makes the optimizer simpler if each micro-op has one action. That way most, if not all, optimizations can remove or replace one micro-op at a time.
  • It speeds up the optimizer if each instruction is a fixed width.
  • Backward passes over the IR are tricky if the instructions aren't of a fixed size.

In terms of the interpreter, it is less important.
However, using the same format as the optimizer means we don't need an additional translation step which should help test the optimizer and compiler.

@gvanrossum
Copy link
Collaborator

  • I always assumed that the tier 2 IR would have fixed-width instructions. But how important is it to have 32 bits instead of 64 bits? Seems a classic time/space tradeoff.
  • How do we fit an 8-byte pointer in 7 bytes? It seems CPU specific and possibly OS specific which bits of a pointer are always zero (apart from the lower three).
  • Could you elaborate on how the optimizer would be simpler? (Or link to a description of the optimizer design from which this can be derived.)

@gvanrossum
Copy link
Collaborator

I'm not sure at what point we will be blocked on this, but based on my recent experiences modernizing parts of bytecodes.c, splitting all bytecodes into uops given some constraint on the size of the cache entries + opcode will be quite time consuming, so we probably need to get a head start on this.

As a counterproposal to Marks suggestion of 32-bit instructions with an opcode and either an oparg or one cache entry, let me suggest that we could start out with 64-bit instructions with an opcode and up to 56 bits worth of opcode and cache data. This would waste space for short instructions like LOAD_FAST, but will reduce (initially) the need to split up too many instructions into tiny uops.

A hack we could use eventually once we are ready for 32-bit instructions: we could still have the occasional variable-length instruction, e.g. we could have the top 8 bits as opcode and the lower 24 bits for cache and/or opcode, and when we need 48 bits, we could use two words where the second word has an opcode that causes a debug exit when executed. You could then even read the instruction stream backwards without confusing cache data with instructions, since the top 8 bits will always be a dedicated bit pattern. This would be more like the current cache entries in bytecode (fixed per opcode) and not like the current EXTENDED_ARG opcode (which must be decoded dynamically), but combine some benefits of both.

@markshannon
Copy link
Member Author

I wouldn't worry too much about the size.
Just make the instructions a struct.
We can start with

struct Inst {
    int32_t opcode;
    int64_t oparg;
}

And whittle it down to 64 bits, then to

struct Inst {
    int32_t opcode:8;
    int32_t oparg:22;
}

without having to change the interpreter.

@markshannon
Copy link
Member Author

How do we fit an 8-byte pointer in 7 bytes? It seems CPU specific and possibly OS specific which bits of a pointer are always zero (apart from the lower three).

We don't. We eliminate pointers from the instruction stream, but that's a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-tier2-optimizer Linear code region optimizer for 3.13 and beyond.
Projects
None yet
Development

No branches or pull requests

3 participants