PLASMA JIT Compiler Implementation

The PLASMA Just-In-Time Compiler is implemented simply as a PLASMA module. Only a few small modifications were required to insert a JIT compiler module into the PLASMA VM and module loader routine. When the PLASMA module loader dynamically loads and links a module, it creates a small stub function in global memory that contains a jump to the VM entrypoint as well as a pointer to the function's byte code, which could exist in auxiliary or extended memory. Different VM entry points expect the byte code to exist in different memory spaces.

The stub has been extended with a per-function call count and a function size value. The maximum size of a byte coded function that can be compiled on the fly is 255 bytes. This maps to a fairly large routine, as the PLASMA byte codes are quite dense. As implemented, the JIT compiler only operates on a machine with 128K or more with byte codes interpreted out of auxiliary/extended memory. A small global memory buffer is pre-allocated for resultant native machine code compilation. When a bytecoded routine has been compiled into native machine code, its stub function is updated to point to the newly minted routine in the global code buffer. All machine coded routines are required to exist in global memory, JIT compiled or otherwise.

As bytecoded routines that have been identified as potential compilation candidates are called, the per-function call count is decremented. Once it reaches zero, the JIT compiler is called with a pointer to the functions stub. The JIT compiler can divine everything it needs from the stub function.

What is the JIT actually compiling?

When a PLASMA module is loaded, it is dynamically linked into the system, resolving all external references, loading other dependencies, and fixing up internal address with the load address. A PLASMA module has a data section and a byte coded function definition section. The data gets loaded into globally accessible memory, the byte code can be located in global or auxiliary/extended memory. The JIT compiler gets the byte code after all the relocations have been applied, so it only has to convert the byte code into machine code. The first order of business is to get a copy of the byte code function into global memory to simplify the process. The byte codes could be fetched directly from auxiliary/extended memory but it adds much overhead. Memory for the byte codes and a buffer used for address translation is allocated at the beginning of the process. If there isn't enough free memory for this buffer, the JIT exits after patching the stub function with the standard interpreter entry point.

The JIT compiler is an optimizing compiler. It attempts to match the native machine's register set to the PLASMA VM's stack based architecture. As such, it tries to optimize values in registers between PLASMA byte codes. A simple compiler would reload registers for each byte code compilation. But optimizing leads to issues when flow of control changes the path of code execution. If a byte code is a destination of a branch, optimizations have to be handled carefully to avoid state corruption. The JIT compiler first scans the function, tagging all the branch destinations and setting flags for data bytes that shouldn't be treated as operational codes.

The PLASMA byte codes are even valued only - this helps with the interpretation of the code because nothing has to be shifted to be used as an index into a jump table. The LSBit is used as the optimization fence determined in the previous jump destination tagging.

Byte code functions have branch destinations calculated by the addresses of the byte codes. When compiling, addresses are going to be different due to the number of machine instructions to implement a byte code's function. An address translation buffer converts a byte code address to the resulting machine code buffer address. It is common that an address translation isn't known until later in the compilation process. Instead of holding the translated address, the entry will be a linked list of addresses awaiting resolution. When an opcode is going to be compiled, the translation buffer is checked for an existing linked list. If found, it traverses the list updating each element with the current native machine code address.

The strategy used by the JIT compiler is three-fold. The PLASMA VM is a stack based machine, using a small evaluation stack located in the Zero Page and indexed by the X register. During execution of byte code operations, the X register is incremented and decremented to access values on the top of the evaluation stack. By tracking the pushes and pops to the stack, a Virtual X offset can be added to the stack address. The stack address is effectively adjusted at compile time instead of the X register at runtime, saving many instructions. The tracked Virtual X and the actual X register have to be resolved at certain times: optimization fences and flow-of-control changes.

when opcode
    ...
    is $30 // DROP
        VX++              // INX
        break
    ...

No code is emitted for DROP above.

    *codeptr = $D095+(VX<<8)         // STA ESTKL,X

Values written and read from the evaluation stack simply add the Virtual X adjustment.

When the Virtual X register and the actual X register need to be resolved, this function is called:

codeptr, VX = resolveX(codeptr, VX)

resolveX() always returns 0 for VX.

PLASMA uses a software defined frame stack for local variables. The IFP (Interpreter Frame Pointer) lives in Zero Page and is indexed by the Y register to access up to 256 bytes per call frame. When interpreting byte codes, The Y register does double duty as the Instruction Pointer LSByte and the index to the IFP. When compiled, there is no need to use the Y register as the IP, so it spends most of its time indexing local variables. It also gets set to zero to fill in the MSByte of byte sized memory reads. Interestingly, the local variable referenced most is index zero. The JIT compiler will track the value of Y, trying to keep it at zero if possible. There are many times when the Y value can remain unchanged between byte codes, or thrown away without having to set it to anything.

By far the biggest optimization is caching the TOS (Top-Of-Stack) LSByte in the Accumulator. Stack machines will read and write the stack almost every instruction. PLASMA 2.0 has extended the byte code instruction set to include immediate and memory based operands. This not only helps the efficiency of the byte code interpreter, it greatly improves the quality of the generated machine code from the JIT compiler. The JIT compiler doesn't have to do any analysis on the byte code sequences to combine operations such as LOAD and ADD. All that work was done AOT by the PLASMA compiler, PLASM. Although PLASMA defines a 16 bit architecture, the JIT compiler can be quite effective when dealing with 8 bit memory values. There are operations on the MSByte of the stack that are required, but with careful coding of the compiler the Accumulator can often cache the LSByte between byte code operations.

Here is an example where all three optimizations are in play for the compilation of the Load Local Byte opcode:

when opcode
    ...
    is $64 // LLB - Load Local Byte
        i++
        j = ^(bytecode+i) // j = Interpreter Frame Pointer (IFP) offset to local variable
        if A_IS_TOSL & TOS_DIRTY
            *codeptr = $D095+(VX<<8)  // STA ESTKL,X
            codeptr  = codeptr + 2
        fin
        VX--                          // DEX
        if VY <> j
            *codeptr = $A0+(j<<8)     // LDY #imm
            codeptr  = codeptr + 2
            VY       = j
        fin
        *codeptr = $E0B1              // LDA (IFP),Y
        codeptr  = codeptr + 2
        if VY <> 0
            *codeptr = $00A0          // LDY #$00
            codeptr  = codeptr + 2
            VY       = 0
        fin
        *codeptr  = $C094+(VX<<8)     // STY ESTKH,X
        codeptr   = codeptr + 2
        A_IS_TOSL = TOS_DIRTY         // STA ESTKL,X
    break

The first thing you will see is that a check for the cached TOS is made. If it is DIRTY, it needs to be written to the Evaluation Stack since we are going to load a new value from the local frame stack. Then the offset of the local variable is checked against the tracked Virtual Y register. If different, we need to load the Y register with the offset value and update our tracked Virtual Y. Then we can load the local byte value. As it is just a byte value, the MSByte has to be set to zero. Here is where we could either use the Accumulator to write a zero to the MSByte of the Evaluation Stack, or use the Y register. By using the Y register we set it to the useful value of zero. Note that had our local index been zero (a very common case), the Y register would already be zero and no need to reload it. Lastly, the loaded value is kept in the Accumulator without writing it to the Evaluation Stack. It is quite probable that the value will be saved to another variable or used directly in a calculation. In the best case scenario, only two instructions are needed to load a byte value from the first local variable: LDA (IFP),Y and STY ESTKH,X. This assumes Y is already zero and the Accumulator is NOT cacheing the TOS.

Here is a real world example:

ESTKH = MSB OF EVALUATION STACK IN ZERO PAGE
ESTKL = LSB OF EVALUATION STACK IN ZERO PAGE
IFP   = INTERPRETER FRAME POINTER (FOR LOCAL VARIABLES)

B007-   A0 00       LDY   #$00
B009-   94 BF       STY   ESTKH-1,X
B00B-   A9 04       LDA   #$04
B00D-   A0 06       LDY   #$06
B00F-   91 E0       STA   (IFP),Y
B011-   A0 01       LDY   #$01
B013-   B1 E0       LDA   (IFP),Y
B015-   95 BF       STA   ESTKH-1,X
B017-   88          DEY
B018-   B1 E0       LDA   (IFP),Y
B01A-   A0 07       LDY   #$07
B01C-   91 E0       STA   (IFP),Y
B01E-   A0 03       LDY   #$03
B020-   B1 E0       LDA   (IFP),Y
B022-   95 BF       STA   ESTKH-1,X
B024-   88          DEY
B025-   B1 E0       LDA   (IFP),Y

This is an actual disassembly from the Apple II code buffer. I've updated the Zero Page addresses with their symbolic names. This is some initialization code filling in local variables. A little further:

B049-   91 E0       STA   (IFP),Y
B04B-   C8          INY
B04C-   B5 BF       LDA   ESTKH-1,X
B04E-   91 E0       STA   (IFP),Y
B050-   A0 00       LDY   #$00
B052-   94 BF       STY   ESTKH-1,X
B054-   A9 CA       LDA   #$CA
B056-   95 CF       STA   ESTKL-1,X
B058-   A9 06       LDA   #$06
B05A-   18          CLC
B05B-   65 E0       ADC   IFP
B05D-   95 CE       STA   ESTKL-2,X
B05F-   98          TYA
B060-   65 E1       ADC   IFP+1
B062-   95 BE       STA   ESTKH-2,X
B064-   CA          DEX
B065-   CA          DEX
B066-   20 64 11    JSR   $1164
B069-   B5 D0       LDA   ESTKL,X
B06B-   8D 17 21    STA   $2117
B06E-   A0 0D       LDY   #$0D

Here it is preparing to make a function call with the address of a local variable (the offset is added to the value of the Interpreted Frame Pointer and passed as a parameter). Note that the X register is finally resolved with the Virtual X offset. Had the Virtual X offset not been tracked, there would have been a lot more INX and DEX instructions sprinkled throughout the code. Also, for the calculation of the local variable's address, the JIT compiler recognized that the Y register contained zero from a previous operation and could make smaller code by using TYA instead of LDA #$00.

Here are the results from the first test of the JIT compiler using the AOT PLASMA compiler, PLASM, as the test case. It is the largest and most sophisticated PLASMA program available. It is also easy to measure the time taken to complete a task. The first version of the JIT compiler had three run-time tunable parameters, and one that required a rebuild. The three tunable parameters were: Initialization Warmup Call Count, Per-Function Call Count, Maximum Function Size. The Warmup Count simply counted down all the calls through the VM's JIT entry point until it reached zero. Then, it would count down the Per-Function Call Count until that reached zero at which point the VM would call the JIT compiler for that function. Only functions that were smaller or equal to the Maximum Function Size were ever passed to the VM's JIT entry point. The last parameter is the size of the code buffer used to compile the byte code into machine code. If it isn't big enough, some byte code function candidates won't be able to be compiled, impacting potential performance.

The following tests were made by timing the entire compilation process, including the loading and dynamically linking of the PLASM compiler. This time and the I/O overhead was the same regardless of test and probably took up about 7 seconds of the test time.

Test: +PLASM SIEVE.PLA

Baseline Interpreted PLASMA VM: 18.81 seconds

Elapsed Time for Maximum Size = 256 and Code Buffer = 6KBytes

	1	8	16	32	64	96
Warmup Count
0	24.48	20.70	19.45	17.14	16.91	17.78
32	24.23	20.30	19.43	17.17	16.91	17.83
64	24.51	20.70	19.56	17.35	17.28
128	24.54	20.54	19.77	17.50	17.19
1024	25.51	21.23	20.12	19.36

With the first round of testing, one thing became very apparent: the Warmup Count didn't contribute anything worthwhile. For the rest of the tests, Warmup Count was set to zero. Then, the Maximum Size was adjusted to see how that affected times:

Elapsed Time for Warmup Count = 0 and Code Buffer = 6KBytes

	1	32	64	96
Max Size
128	22.80	17.10	16.97	17.76
64		19.70	19.28	19.25

It looks like Max Size of 128 is about the same as 255, so in search of a minima, I iterated through a range of Call Counts from 40 to 56:

Elapsed Time for Warmup Count = 0, Max Size = 128 and Code Buffer = 6KBytes

Call Count	40	44	45	46	48	50	52	56
	16.56	16.48	16.44	16.51	16.56	16.65	16.73	16.57

I rebuilt the VM with a 4K code buffer and re-ran many of the test cases. The result was the same across the board, so the 4K code buffer was sufficient for JIT compiling the PLASM compiler test case.

The last test was to compile a much more substantial PLASMA source code program with PLASM and measure the overall time. The RPNCALC.PLA program was used to time the baseline, 6K JIT and 4K JIT VMs.

Test: +PLASM RPNCALC.PLA

Baseline Interpreted PLASMA VM: 217.48 seconds

PLASMA 6K Code Buffer JIT VM: 117.92 seconds

PLASMA 4K Code Buffer JIT VM: 119.02 seconds

With these results, I made one change to the JIT VM: I removed the Warmup Count altogether. It cleaned up the VM's JIT entry point nicely. A 4K code buffer is pre-allocated and the default Call Count is 36. I put the Maximum Size at 96.

Although the 6502 is a very limited resource CPU, the PLASMA JIT compiler puts the available registers to good use throughout the execution of the generated machine code.