Vm size optimizations: Get back 1500 bytes for 3.2% VM speed decrease #4344

jepler · 2021-03-05T23:08:22Z

I found two size optimizations in the main "virtual machine" implementing function, mp_execute_bytecode. It causes a modest speed decrease, so it's only turned on for samd21 builds.

There were two optimizations:

First, (thanks to @dhalbert for a related idea about "pointer compression") a table called entry_table was reduced from being a 4-byte type to being a 2-byte type; however, a small amount of arithmetic was added to each DISPATCH(), making the size savings about half of what I'd hoped for
Second, realizing that the code in the DISPATCH macro itself is a significant part of the overall code size of mp_execute_bytecode, consolidate all of them into a single ONE_TRUE_DISPATCH(), which the others reach by goto. This saves much more space, at the expense of one additional jump for every bytecode encountered.

Sizes and timings from a Feather M0 Adalogger and the English language build.

Version	Free Flash	Increase	Benchmark Time	Increase
Original	2100 bytes		.655s
Compress Table	2332 bytes	+232	.666s	+1.7%
One True Dispatch	3600 bytes	+1268/+1500	.676s	+1.7%/+3.2%

Simple timing program:

import time
t0 = time.monotonic()
s = 0
for i in range(10000):
    s += i
t1 = time.monotonic()
print (t1-t0, s)

This also adds a bit of code everywhere we DISPATCH(), but the net is +232 bytes free on Feather M0 Adalogger. Key assumption: All of the offsets in mp_execute_bytecode fit in 16 bits; it is not clear whether the compiler will verify this assumption (e.g., by warning that a constant will be truncated)

Flash savings: 1268 bytes Performance: 10,000 iteration loop .665 -> .676s (+1.7%)

dhalbert · 2021-03-05T23:10:07Z

This is great!!!

.. and enable for all samd21 boards

jepler · 2021-03-06T02:13:04Z

CI failures seemed to be network-related.

ladyada · 2021-03-06T20:43:37Z

i kicked it :)

dhalbert

Thanks very much for this! In the long run I have been thinking about labeling builds as "192kB", "256kB", etc., rather than "FULL_BUILD", etc. to give a little more information. But the SAMD21 distinction works well for the current mix of boards.

tannewt · 2021-03-08T23:24:52Z

Nice work! cc @dpgeorge

dpgeorge · 2021-03-09T11:43:13Z

Interesting!

Did you try compiling with MICROPY_OPT_COMPUTED_GOTO disabled, ie using a big switch in the VM? I just tried this patch out here and it seems that a big switch is still smaller than what is here.

Using minimal port, cross compiled to Cortex-M4 with -Os:

build options                                       fw size     diff to baseline
baseline MICROPY_OPT_COMPUTED_GOTO=0                67224       +0  
MICROPY_OPT_COMPUTED_GOTO=1                         69008       +1784   
MICROPY_OPT_COMPUTED_GOTO=1 w/ compressed table:    68516       +1292   
MICROPY_OPT_COMPUTED_GOTO=1 w/ one-true-dispatch:   67844       +620    
MICROPY_OPT_COMPUTED_GOTO=1 w/ both above:          67336       +112

dhalbert · 2021-03-09T13:14:32Z

We have not disabled MICROPY_OPT_COMPUTED_GOTO because turning it on produced a 5x (!) gain in speed: #1934, #1933

dpgeorge · 2021-03-09T14:00:29Z

because turning it on produced a 5x (!) gain in speed

That doesn't seem right... using computed goto or not shouldn't affect how often the VM hook macros are executed, and shouldn't lead to such a huge difference in speed. I just tested this by running our benchmark suite on a PYBLITEv1.0 (STM32F411) and turning off MICROPY_OPT_COMPUTED_GOTO leads to about a 1-2% decrease in performance.

dhalbert · 2021-03-10T14:51:19Z

We did not run a benchmark suite but instead a simple loop test: #1933 (comment)

jepler added 2 commits March 5, 2021 16:52

vm: Consolodate all dispatch instructions

7b359d7

Flash savings: 1268 bytes Performance: 10,000 iteration loop .665 -> .676s (+1.7%)

vm: Make the speed-size trade-off compile time settable

4f040af

.. and enable for all samd21 boards

jepler marked this pull request as ready for review March 6, 2021 02:12

dhalbert approved these changes Mar 6, 2021

View reviewed changes

jepler mentioned this pull request Mar 7, 2021

Interesting size savings in mp_execute_bytecode: -1500 bytes, -3.2% speed micropython/micropython#7004

Closed

jepler merged commit a4133c4 into adafruit:main Mar 7, 2021

dhalbert mentioned this pull request Mar 8, 2021

6.2.0 beta 4 release notes #4361

Closed

jepler deleted the vm-size-optimizations branch November 3, 2021 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vm size optimizations: Get back 1500 bytes for 3.2% VM speed decrease #4344

Vm size optimizations: Get back 1500 bytes for 3.2% VM speed decrease #4344

jepler commented Mar 5, 2021 •

edited

Loading

dhalbert commented Mar 5, 2021

jepler commented Mar 6, 2021

ladyada commented Mar 6, 2021

dhalbert left a comment

tannewt commented Mar 8, 2021

dpgeorge commented Mar 9, 2021

dhalbert commented Mar 9, 2021 •

edited

Loading

dpgeorge commented Mar 9, 2021

dhalbert commented Mar 10, 2021

Vm size optimizations: Get back 1500 bytes for 3.2% VM speed decrease #4344

Vm size optimizations: Get back 1500 bytes for 3.2% VM speed decrease #4344

Conversation

jepler commented Mar 5, 2021 • edited Loading

dhalbert commented Mar 5, 2021

jepler commented Mar 6, 2021

ladyada commented Mar 6, 2021

dhalbert left a comment

Choose a reason for hiding this comment

tannewt commented Mar 8, 2021

dpgeorge commented Mar 9, 2021

dhalbert commented Mar 9, 2021 • edited Loading

dpgeorge commented Mar 9, 2021

dhalbert commented Mar 10, 2021

jepler commented Mar 5, 2021 •

edited

Loading

dhalbert commented Mar 9, 2021 •

edited

Loading