Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vm size optimizations: Get back 1500 bytes for 3.2% VM speed decrease #4344

Merged
merged 3 commits into from
Mar 7, 2021

Conversation

jepler
Copy link
Member

@jepler jepler commented Mar 5, 2021

I found two size optimizations in the main "virtual machine" implementing function, mp_execute_bytecode. It causes a modest speed decrease, so it's only turned on for samd21 builds.

There were two optimizations:

  • First, (thanks to @dhalbert for a related idea about "pointer compression") a table called entry_table was reduced from being a 4-byte type to being a 2-byte type; however, a small amount of arithmetic was added to each DISPATCH(), making the size savings about half of what I'd hoped for
  • Second, realizing that the code in the DISPATCH macro itself is a significant part of the overall code size of mp_execute_bytecode, consolidate all of them into a single ONE_TRUE_DISPATCH(), which the others reach by goto. This saves much more space, at the expense of one additional jump for every bytecode encountered.

Sizes and timings from a Feather M0 Adalogger and the English language build.

Version Free Flash Increase Benchmark Time Increase
Original 2100 bytes .655s
Compress Table 2332 bytes +232 .666s +1.7%
One True Dispatch 3600 bytes +1268/+1500 .676s +1.7%/+3.2%

Simple timing program:

import time
t0 = time.monotonic()
s = 0
for i in range(10000):
    s += i
t1 = time.monotonic()
print (t1-t0, s)

This also adds a bit of code everywhere we DISPATCH(), but the net is
+232 bytes free on Feather M0 Adalogger.

Key assumption: All of the offsets in mp_execute_bytecode fit in 16 bits;
it is not clear whether the compiler will verify this assumption (e.g.,
by warning that a constant will be truncated)
Flash savings: 1268 bytes
Performance: 10,000 iteration loop .665 -> .676s (+1.7%)
@dhalbert
Copy link
Collaborator

dhalbert commented Mar 5, 2021

This is great!!!

.. and enable for all samd21 boards
@jepler jepler marked this pull request as ready for review March 6, 2021 02:12
@jepler
Copy link
Member Author

jepler commented Mar 6, 2021

CI failures seemed to be network-related.

@ladyada
Copy link
Member

ladyada commented Mar 6, 2021

i kicked it :)

Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this! In the long run I have been thinking about labeling builds as "192kB", "256kB", etc., rather than "FULL_BUILD", etc. to give a little more information. But the SAMD21 distinction works well for the current mix of boards.

@tannewt
Copy link
Member

tannewt commented Mar 8, 2021

Nice work! cc @dpgeorge

@dpgeorge
Copy link

dpgeorge commented Mar 9, 2021

Interesting!

Did you try compiling with MICROPY_OPT_COMPUTED_GOTO disabled, ie using a big switch in the VM? I just tried this patch out here and it seems that a big switch is still smaller than what is here.

Using minimal port, cross compiled to Cortex-M4 with -Os:

build options                                       fw size     diff to baseline
baseline MICROPY_OPT_COMPUTED_GOTO=0                67224       +0  
MICROPY_OPT_COMPUTED_GOTO=1                         69008       +1784   
MICROPY_OPT_COMPUTED_GOTO=1 w/ compressed table:    68516       +1292   
MICROPY_OPT_COMPUTED_GOTO=1 w/ one-true-dispatch:   67844       +620    
MICROPY_OPT_COMPUTED_GOTO=1 w/ both above:          67336       +112

@dhalbert
Copy link
Collaborator

dhalbert commented Mar 9, 2021

We have not disabled MICROPY_OPT_COMPUTED_GOTO because turning it on produced a 5x (!) gain in speed: #1934, #1933

@dpgeorge
Copy link

dpgeorge commented Mar 9, 2021

because turning it on produced a 5x (!) gain in speed

That doesn't seem right... using computed goto or not shouldn't affect how often the VM hook macros are executed, and shouldn't lead to such a huge difference in speed. I just tested this by running our benchmark suite on a PYBLITEv1.0 (STM32F411) and turning off MICROPY_OPT_COMPUTED_GOTO leads to about a 1-2% decrease in performance.

@dhalbert
Copy link
Collaborator

We did not run a benchmark suite but instead a simple loop test: #1933 (comment)

@jepler jepler deleted the vm-size-optimizations branch November 3, 2021 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants