Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to allow different compiler optimizations per board #3190

Merged
merged 16 commits into from
Jul 25, 2020
Merged

Changes to allow different compiler optimizations per board #3190

merged 16 commits into from
Jul 25, 2020

Conversation

DavePutz
Copy link
Collaborator

@DavePutz DavePutz commented Jul 22, 2020

While working on some performance issues, it was found that using the -O2 option instead of -Os on boards with sufficient flash memory can greatly help performance.
Connie & I did some testing on the boards we have available. The tests we ran were:

  1. The long-running bigint calculation script from issue ItsyBitsy M4 Express w/ CP5.3.0: bignum(?) code prevents USB from functioning (mount and REPL access) #2949
  2. The script from https://pastebin.com/BAUS82X9
  3. The ulab simple benchmark script from https://learn.adafruit.com/ulab-crunch-numbers-fast-with-circuitpython/a-simple-
    benchmark
  4. For boards with a display:
    sprite.py (attached in zip file)
    dispio_test.py (attached in zip file

Here is a table of our results showing the improvement pct. using -O2

- BOARD                                             TEST1             TEST2                TEST3                 TEST4                        
- itsybitsy_m4_express                              42%                26%                     17%               Not run
- pyportal                                          20%                21%                     10%                  7%
- pybadge                                           44%                13%                     14%                 13%
- esp32s2 wroom                                     31%                53%                   No ulab            Not run
- circuitplayground_bluefruit                       42%                15%                      15%              Not run
- feather_stm32f405_express                         52%                18%                      16%              Not run

The option has been implemented by adding a line:
OPTIMIZATION_LEVEL =
to the boards/$BOARD/mpconfigboard.mk file
and putting code into the various Makefiles to add that value to CFLAGS if present. This allows boards
for which no change is desired to build exactly as before. The modification is taking advantage
of the fact that if there are multiple optimization (-On) parameters given to gcc the last one wins.

test_scripts.zip
)

@deshipu
Copy link

deshipu commented Jul 22, 2020

This is probably making it too complicated, but I wonder if it would even make sense to compile different objects with different optimization levels, so even if there isn't enough room to optimize everything, you can have the most vital parts optimized...

ifdef OPTIMIZATION_LEVEL
CFLAGS += -O$(OPTIMIZATION_LEVEL)
endif

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't the CFLAGS from lines 90, 97 and 104 interfere with this?

@dhalbert
Copy link
Collaborator

Note that in py/py.mk, gc.o and vm.o will be compiled -O3 if SUPEROPT_GC or SUPEROPT_VM are set to 1 (which is the default). So could you make sure that this change is not overriding that higher optimization level? Thanks.

@@ -15,3 +15,4 @@ LONGINT_IMPL = MPZ
CIRCUITPY_AUDIOBUSIO = 0

CIRCUITPY_BITBANG_APA102 = 1
OPTIMIZATION_LEVEL = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding this flag for individual boards, I think it would make sense to do -O2 by default for all SAMD51 and nRF52 Express boards. There may be space problems on non-Express boards.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dhalbert See comment below. I think we will need to implement this on a board-by-board basis.

Copy link
Collaborator Author

@DavePutz DavePutz Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can always use INTERNAL_FLASH_FILESYSTEM = 1 as a guide. There is at least one board (sam32) that has that set but has 1M of flash (i.e. plenty for the 10% size increase). And,as future chips may well fall into that same category. I would consider it safer to have a separate flag. Also, the non_SAMD chips need to be considered as well.

@dhalbert
Copy link
Collaborator

Thank you for doing all this testing! We have not pursued this, and it's really nice someone is.

@DavePutz
Copy link
Collaborator Author

DavePutz commented Jul 23, 2020 via email

@DavePutz
Copy link
Collaborator Author

DavePutz commented Jul 23, 2020 via email

@DavePutz
Copy link
Collaborator Author

DavePutz commented Jul 23, 2020 via email

@dhalbert
Copy link
Collaborator

I tried compiling with -O2 for all SAMD51 and SAMD54 boards, but there were three boards (loc_ber_m4_base_board, pewpew_m4, and one other I don't remember) that did not have sufficient flash.

These boards all have INTERNAL_FLASH_FILESYSTEM = 1 in mpconfigboard.mk, so that could be the trigger for not adding -O2 to a SAMD51 board. Nearly all the other boards either

I have tried in the past to minimize the number of board-specific settings in mpconfigboard.mk if they can be derived from other settings.

@dhalbert
Copy link
Collaborator

@dhalbert - Because gcc only pays attention to the last -O option, the -O3 for gc.o and vm.o will still be in effect. From the (very long) compile line: ... -DCFG_TUD_MIDI_TX_BUFSIZE=128 -DCFG_TUD_CDC_TX_BUFSIZE=256 -DCFG_TUD_MSC_BUFSIZE=1024 -O2 -flto -flto-partition=none ... -O3 -c -MD -o build-pyportal/py/vm.o ../../py/vm.c

Thanks, I was not sure the new setting would be the last one, since it depends on the order of applying CFLAGS +=.

The -O2 could also be set by parameterizing the -Os that's currently in ports/atmel-samd/Makefile. Then it's not order-dependent.

Also note in py/py/mk, line 351, nlr%.o is always -Os, and should not be changed. I checked and the order is OK there too:

arm-none-eabi-gcc -DCIRCUITPY_FULL_BUILD=1 -DCIRCUITPY_AESIO=0 -DCIRCUITPY_ANALOGIO=1 -DCIRCUITPY_AUDIOBUSIO=1 -DCIRCUITPY_AUDIOIO=1 -DCIRCUITPY_AUDIOPWMIO=0 -DCIRCUITPY_AUDIOCORE=1 -DCIRCUITPY_AUDIOMIXER=1 -DCIRCUITPY_AUDIOMP3=1 -DCIRCUITPY_BITBANGIO=1 -DCIRCUITPY_BLEIO=0 -DCIRCUITPY_BOARD=1 -DCIRCUITPY_BUSIO=1 -DCIRCUITPY_DIGITALIO=1 -DCIRCUITPY_COUNTIO=1 -DCIRCUITPY_DISPLAYIO=1 -DCIRCUITPY_FRAMEBUFFERIO=1 -DCIRCUITPY_VECTORIO=1 -DCIRCUITPY_FREQUENCYIO=1 -DCIRCUITPY_GAMEPAD=1 -DCIRCUITPY_GAMEPADSHIFT=0 -DCIRCUITPY_GNSS=0 -DCIRCUITPY_I2CPERIPHERAL=1 -DCIRCUITPY_MATH=1 -DCIRCUITPY__EVE=1 -DCIRCUITPY_MICROCONTROLLER=1 -DCIRCUITPY_NEOPIXEL_WRITE=1 -DCIRCUITPY_NETWORK=0 -DCIRCUITPY_NVM=1 -DCIRCUITPY_OS=1 -DCIRCUITPY_PIXELBUF=1 -DCIRCUITPY_RGBMATRIX=1 -DCIRCUITPY_PULSEIO=1 -DCIRCUITPY_PS2IO=1 -DCIRCUITPY_RANDOM=1 -DCIRCUITPY_ROTARYIO=1 -DCIRCUITPY_RTC=1 -DCIRCUITPY_SAMD=1 -DCIRCUITPY_SDCARDIO=1 -DCIRCUITPY_SDIOIO=0 -DCIRCUITPY_STAGE=0 -DCIRCUITPY_STORAGE=1 -DCIRCUITPY_STRUCT=1 -DCIRCUITPY_SUPERVISOR=1 -DCIRCUITPY_TIME=1 -DCIRCUITPY_TOUCHIO_USE_NATIVE=0 -DCIRCUITPY_TOUCHIO=1 -DCIRCUITPY_UHEAP=0 -DCIRCUITPY_USB_HID=1 -DCIRCUITPY_USB_MIDI=1 -DCIRCUITPY_PEW=0 -DCIRCUITPY_USTACK=0 -DCIRCUITPY_BITBANG_APA102=0 -DCIRCUITPY_REQUIRE_I2C_PULLUPS=1 -DCIRCUITPY_SERIAL_BLE=0 -DCIRCUITPY_BLE_FILE_SERVICE=0 -DCIRCUITPY_SERIAL_UART=0 -DCIRCUITPY_ULAB=1 -DCIRCUITPY_WATCHDOG=0 -DCIRCUITPY_ENABLE_MPY_NATIVE=0 -DINTERNAL_FLASH_FILESYSTEM=0 -DQSPI_FLASH_FILESYSTEM=1 -DSPI_FLASH_FILESYSTEM=0 -DEXTERNAL_FLASH_DEVICES="S25FL116K, S25FL216K, GD25Q16C" -DEXTERNAL_FLASH_DEVICE_COUNT=3 -DUSB_AVAILABLE -DLONGINT_IMPL_MPZ -Os -DNDEBUG -DCFG_TUSB_MCU=OPT_MCU_SAMD51 -DCFG_TUD_MIDI_RX_BUFSIZE=128 -DCFG_TUD_CDC_RX_BUFSIZE=256 -DCFG_TUD_MIDI_TX_BUFSIZE=128 -DCFG_TUD_CDC_TX_BUFSIZE=256 -DCFG_TUD_MSC_BUFSIZE=1024 -flto -flto-partition=none -I. -I../.. -I../lib/mp-readline -I../lib/timeutils -Iasf4/samd51 -Iasf4/samd51/hal/include -Iasf4/samd51/hal/utils/include -Iasf4/samd51/hri -Iasf4/samd51/hpl/core -Iasf4/samd51/hpl/gclk -Iasf4/samd51/hpl/pm -Iasf4/samd51/hpl/port -Iasf4/samd51/hpl/rtc -Iasf4/samd51/hpl/tc -Iasf4/samd51/include -Iasf4/samd51/CMSIS/Include -Iasf4_conf/samd51 -Iboards/metro_m4_express -Iboards/ -Iperipherals/ -Ifreetouch -I../../lib/tinyusb/src -I../../supervisor/shared/usb -Ibuild-metro_m4_express -Wall -Werror -std=gnu11 -nostdlib -fshort-enums -fsingle-precision-constant -fno-strict-aliasing -Wdouble-promotion -Wno-endif-labels -Wstrict-prototypes -Werror-implicit-function-declaration -Wfloat-equal -Wundef -Wshadow -Wwrite-strings -Wsign-compare -Wmissing-format-attribute -Wno-deprecated-declarations -Wnested-externs -Wunreachable-code -Wcast-align -Wno-error=lto-type-mismatch -D__SAMD51J19A__ -ffunction-sections -fdata-sections -DCIRCUITPY_SOFTWARE_SAFE_MODE=0x0ADABEEF -DCIRCUITPY_CANARY_WORD=0xADAF00 -DCIRCUITPY_SAFE_RESTART_WORD=0xDEADBEEF --param max-inline-insns-single=500 -DFFCONF_H=\"lib/oofatfs/ffconf.h\" -DCIRCUITPY_ULAB=1 -DMODULE_ULAB_ENABLED=1  -mthumb -mabi=aapcs-linux -mcpu=cortex-m4 -mfloat-abi=hard -mfpu=fpv4-sp-d16 -DSAM_D5X_E5X -DSAMD51 -Os -c -MD -o build-metro_m4_express/py/nlrsetjmp.o ../../py/nlrsetjmp.c

@DavePutz DavePutz marked this pull request as ready for review July 23, 2020 03:35
@dhalbert
Copy link
Collaborator

dhalbert commented Jul 23, 2020

@dhalbert See comment below. I think we will need to implement this on a board-by-board basis.

That is what I am trying to avoid. I would like the defaults for a particular chip or chip family to be the most common case, and then the boards that don't work can be overriden in mpconfigboard.mk. I would like most mpconfigboard.mk files not to have to change. For instance, the current PR does not change a large number of SAMD51 boards that could benefit from -O2.

I did a big refactoring of the board files several months ago and was able to reduce their sizes significantly by choosing reasonable defaults for various options, instead of including a large amount of boilerplate in each file. I am trying to avoid creeping back to that. It makes it easier to add a new board when the number of required options is small.

I tried compiling with -O2 for all SAMD51 and SAMD54 boards, but there were three boards (loc_ber_m4_base_board, pewpew_m4, and one other I don't remember) that did not have sufficient flash. Also, I am hesitant to make changes for a board I have not tested.So, making it a global change was not something I thought would be a good idea.

I wouldn't worry about not being able to test the individual boards in person. The thing to do is to make the change, submit the PR, see which ones fail, and fix just those boards.

These would be the general defaults, and then the mpconfigboard.mk files can override:
SAMD21: -Os
SAMx5x: -O2
nRF528xx: -O2. If all the nRF52833 boards don't fit (they have smaller flash), then add a further conditional to do this only for nRF52840.
STM32: These defaults may need to be per chip or chip subfamily.
litex, cxd56, and other specialized ports: pick the most reasonable option in the Makefile for that port.

A couple of other things:

  1. Instead of OPTIMIZATION_LEVEL, I would suggest OPTIMIZATION_FLAGS, and then set it to -O<level>. This makes it clearer what is being set, and allows more than one optimization-related flag to be set, in case someone would like to add extra -Ox or -fsomething flags that would be advantageous.

  2. Change the original -Os definition in all the port Makefiles to be dependent on OPTIMIZATION_FLAGS. So for instance, here's what's in atmel-samd/ports/Makefile now (with elisions for brevity):

ifeq ($(CHIP_FAMILY), samd21)
...
CFLAGS += -Os -DNDEBUG
...
endif

ifeq ($(CHIP_FAMILY), samd51)
...
CFLAGS += -Os -DNDEBUG
...
endif

ifeq ($(CHIP_FAMILY), same54)
...
CFLAGS += -Os -DNDEBUG
...
endif

I would change the above to something like the below. Note the use of ?=, which is less bulky than an ifndef:

ifeq ($(CHIP_FAMILY), samd21)
...
OPTIMIZATION_FLAGS ?= -Os
...
endif

ifeq ($(CHIP_FAMILY), samd51)
...
# boards without external flash may need -Os
OPTIMIZATION_FLAGS ?= -O2
...
endif

ifeq ($(CHIP_FAMILY), same54)
...
OPTIMIZATION_FLAGS ?= -O2
...
endif

CFLAGS += $(OPTIMIZATION_FLAGS) -DNDEBUG

After the above, all the SAMD51 boards will have -O2. Some will fail, and overriding OPTIMIZATION_FLAGS settings can then be added to their mpconfigboard.mk file.

@DavePutz
Copy link
Collaborator Author

@dhalbert - I see your point. I was just unsure about making changes for boards I have not tested. I did try using -O3 on a couple of boards and saw problems (like not booting properly). This made me cautious about increasing the optimization level globally. So, I will change the defaults in the Makefiles and then let them be overridden if needed by the boards mpconfigboard.mk.

Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! pewpew_m4 is failing; it needs -Os, I assume

One interesting thing which doesn't need to be done in this PR is to do the -Os only for particular language builds that don't fit.

@@ -14,3 +14,5 @@ EXTERNAL_FLASH_DEVICES = "GD25Q16C"
# We use a CFLAGS define here because there are include order issues
# if we try to include "mpconfigport.h" into nrfx_config.h .
CFLAGS += -DCIRCUITPY_NRF_NUM_I2C=2

OPTIMIZATION_LEVEL = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is left-over from the first version of this PR and can be removed.

@@ -15,3 +15,5 @@ LD_COMMON = boards/common_default.ld
LD_DEFAULT = boards/STM32F405_default.ld
LD_BOOT = boards/STM32F405_boot.ld # UF2 boot option
UF2_OFFSET = 0x8010000

OPTIMIZATION_LEVEL = 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftover?

Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, pewpew_m4 is still failing.

Copy link
Collaborator

@dhalbert dhalbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I am surprised more boards didn't fail: more than I expected have room for `-O2', which is good.

@dhalbert
Copy link
Collaborator

The pre-commit check is stuck, but it was fine before. Never mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants