Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiling with Lottie Library enabled in LVGL configuration #8

Closed
tvanfossen opened this issue Dec 8, 2021 · 17 comments
Closed

Compiling with Lottie Library enabled in LVGL configuration #8

tvanfossen opened this issue Dec 8, 2021 · 17 comments

Comments

@tvanfossen
Copy link

Describe the bug

Compiling examples/rlottie fails x

To Reproduce

Compile latest LVGL on master with lottie library enabled in sdkconfig
lv_lib_rlottie added as a submodule under lvgl/lv_lib/lv_lib_rlottie

LVGL otherwise works as intended, but havent been able to get it to compile with the lottie option in the sdkconfig. Very interested in trying on the S3 specifically to see how large of a lottie animation can play on a 240x320 SPI TFT (ESP32 was not performant enough to play larger than 120x120 animations)

Expected behavior

Will not compile

rlottie_capi.h is a dependency from the full rlottie lib (also in lv_lib_rlottie)

Screenshots or video

image

@tvanfossen
Copy link
Author

Wanted to note that lv_lib_rlottie will compile and run on an ESP32

lvgl/lv_lib_rlottie#3

@tvanfossen
Copy link
Author

https://docs.lvgl.io/master/libs/rlottie.html

Will try following this

@tvanfossen
Copy link
Author

ezgif-6-4ee95c73f50a
ezgif-2-8371cc15a483

Took some hacky fixes, but full screen (240x320) lottie file running on ESP32S3

@X-Ryl669
Copy link

X-Ryl669 commented Dec 9, 2021

What is the difference so important between the ESP32 and the ESP32-S3 ?

Is it just the LX7 vs LX6 architecture or is the LCD in your ESP-S3 board using multiple SPI lines (DIO/QIO/8IO) so transfers are faster ?
Or maybe you are using the DMA engine ?

@tvanfossen
Copy link
Author

Lottie files, per discussion in lv_lib_rlottie, consume a significant amount of memory in building the image buffer which is then used by rlottie to draw into (not discounting the amount of memory that then gets used by rlottie while drawing the buffer).

The ESP32-Wrover has 4MB of PSRAM - this is enough to be able to load a 240x320 lottie file, but is in no way performant (at or below 1 FPS).

LCD shown above vs early working lottie demos in lv_lib_rlottie is effectively the same. Single channel SPI interface to the display. A parallel interface (such as the one used in the ESP-BOX) could potentially get the lottie files running even smoother, but at the cost of IO (project dependent as to whether its affordable). I'd be intrigued to compare performance to an actual ESP-BOX devkit with the parallel interface

I expect the LX7 is fundamentally improved over LX6, but the specific part (my belief) the ESP32-S3 has over the ESP32 (depending on config) is an octal spi interface to PSRAM, enabling significantly faster comms than on the WROVER to the same block of PSRAM - I believe this is one of the key enabling differences between the ESP32 and S3 in getting a full screen lottie animation to play.

The ESP32, without PSRAM, doesnt have enough available memory to allocate the full buffer for a 240x320 lottie animation. 240x320x32/8 == ~300kb. DRAM segment ~256kb, before any chunks start to get eaten by freertos.

Another possibility - rlottie uses a signifcant number of floating point calculations (vector graphics after all). ESP32, from my reading, uses a VFPU, documentation for the S3 seemed to point to this VFPU being improved over the ESP32, but I dont have hard numbers to back that up.

@X-Ryl669
Copy link

X-Ryl669 commented Mar 3, 2022

For reference, here's a benchmark of ESP32s3 vs ESP32 on highly optimized FP32 code:
https://docs.espressif.com/projects/esp-dsp/en/latest/esp-dsp-benchmarks.html

For the PSRAM, if I follow what Espressif is saying, it shouldn't matter since the PSRAM loads 32kB in the CPU cache, and works from here. But, since the rlottie code is using a lot of classes containing pointers in their members, it might cross the 32kB boundary and cause the PSRAM to swap. Hard to know without measuring...

I'll try to use the IRAM area for the rendering buffer, since, at least this one, is allocated from LVGL, access are 32 bits aligned and we have plenty left usually. On my ESP32 system, I couldn't get more than 12 fps on such animation, and I can't transfer the whole screen to the SPI LCD that fast. Even scrolling is slow on my system.

@tvanfossen
Copy link
Author

Above link is accounting for 02 optimization, I have been unable to get rlottie to compile outside of debug compiler settings - is this something that you fixed in your fork of rlottie?

I have not measured if it goes over 32kb, but would not surprise me on larger animations especially (maybe this is where smaller animations were performing better than larger ones?)

image
This is a small change I add to lv_rlottie to move the lottie buffer into SPIRAM, rather than use lv_mem_alloc.
image

I assume the test on ESP32 with IRAM was using your partial rendering PR? I'd be curious to try this on an S3

Fwiw, I just got an S3 with 2mb psram and quad spi (vs 8mb psram and octal spi) yesterday. The octal spi part is performant and plays at 28-33 fps with 240x320 animations, the quad spi part is dropping to around 15-20fps (same code between the two, aside from the change in sdkconfig for quad spi vs octal spi). I haven't taken hard numbers, but I think this starts to support the improvement that octal vs quad vs single spi on PSRAM can have.

Also would note that both the octal/quad chips are both connecting to a LAN and able to hit http end points for those devices, something that I was unable to ever get running alongside lottie on the ESP32 due to memory issues

@X-Ryl669
Copy link

I've run a simple test on a ESP32 (first version) with different rendering buffers (either SPIRAM, IRAM or DRAM) and here are the results:

Bench psram
---------------------
Value =   1359523, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    268212, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls 
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     43605, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     22912, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    753508, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     76206, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   1342710, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Bench iram
---------------------
Value =   1128851, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    268098, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls 
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     43586, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     22902, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    620283, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     76172, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   1128036, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Bench dram
---------------------
Value =    939930, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    268008, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls 
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     43560, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     22890, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    563117, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     76136, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =    939577, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles

To sum up, on DRAM, the rendering is 30% faster than SPIRAM and IRAM is 15% faster than SPIRAM (I guess because of the overhead of some 8 bit accesses on the 32bits IRAM that trap the CPU and must be emulated).
This doesn't explain the difference you are observing.

@X-Ryl669
Copy link

The difference reduces to 25% when accounting for color conversion (DRAM is only 25% faster than SPIRAM). Couldn't test IRAM here, since color conversion code is doing unaligned access that's not supported by IRAM.

The next step would be to have rlottie allocate all its structures in DRAM to test (a lot harder since it implies modifying the C++ allocator system wide not to use SPIRAM...).

@X-Ryl669
Copy link

X-Ryl669 commented Mar 12, 2022

Ok, here's the last test, and it matches what you observed.

Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4192095 allocated 20 min_free 4192095
    largest_free_block 4128768 alloc_blocks 1 free_blocks 1 total_blocks 2
  Totals:
    free 4192095 allocated 20 min_free 4192095 largest_free_block 4128768
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 1328 allocated 4668 min_free 152
    largest_free_block 1024 alloc_blocks 51 free_blocks 1 total_blocks 52
  At 0x3ffb67b8 len 170056 free 83568 allocated 85528 min_free 83568
    largest_free_block 81920 alloc_blocks 43 free_blocks 1 total_blocks 44
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 58544 allocated 0 min_free 58516
    largest_free_block 57344 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 303035 allocated 90196 min_free 301831 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 1328 allocated 4668 min_free 152
    largest_free_block 1024 alloc_blocks 51 free_blocks 1 total_blocks 52
  At 0x3ffb67b8 len 170056 free 83568 allocated 85528 min_free 83568
    largest_free_block 81920 alloc_blocks 43 free_blocks 1 total_blocks 44
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 244491 allocated 90196 min_free 243315 largest_free_block 110592
Opened animation 61 frames 10
Bench psram
---------------------
Value =   4091424, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    380758, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls 
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     48024, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     39438, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =   2104364, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     84908, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   4032131, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4118399 allocated 73716 min_free 4118399
    largest_free_block 4063232 alloc_blocks 235 free_blocks 3 total_blocks 238
  Totals:
    free 4118399 allocated 73716 min_free 4118399 largest_free_block 4063232
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 304 allocated 5692 min_free 152
    largest_free_block 0 alloc_blocks 75 free_blocks 0 total_blocks 75
  At 0x3ffb67b8 len 170056 free 82836 allocated 86260 min_free 82836
    largest_free_block 81920 alloc_blocks 56 free_blocks 1 total_blocks 57
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 58544 allocated 0 min_free 58516
    largest_free_block 57344 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 301279 allocated 91952 min_free 301099 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 304 allocated 5692 min_free 152
    largest_free_block 0 alloc_blocks 75 free_blocks 0 total_blocks 75
  At 0x3ffb67b8 len 170056 free 82836 allocated 86260 min_free 82836
    largest_free_block 81920 alloc_blocks 56 free_blocks 1 total_blocks 57
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 242735 allocated 91952 min_free 242583 largest_free_block 110592
Opened animation 61 frames 10
Bench iram
---------------------
Value =   1146097, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    268094, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls 
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     43576, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     22900, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    632061, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     76167, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   1120922, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4192063 allocated 52 min_free 4118399
    largest_free_block 4128768 alloc_blocks 3 free_blocks 2 total_blocks 5
  Totals:
    free 4192063 allocated 52 min_free 4118399 largest_free_block 4128768
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 25544 allocated 143552 min_free 25544
    largest_free_block 23552 alloc_blocks 288 free_blocks 3 total_blocks 291
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 42160 allocated 16384 min_free 42160
    largest_free_block 40960 alloc_blocks 1 free_blocks 1 total_blocks 2
  Totals:
    free 227599 allocated 165632 min_free 227451 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 25544 allocated 143552 min_free 25544
    largest_free_block 23552 alloc_blocks 288 free_blocks 3 total_blocks 291
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 185439 allocated 149248 min_free 185291 largest_free_block 110592
Opened animation 61 frames 10
Bench dram
---------------------
Value =   1262170, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    379254, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls 
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     47692, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     39289, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    723814, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     84387, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   1213166, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4192063 allocated 52 min_free 4118399
    largest_free_block 4128768 alloc_blocks 3 free_blocks 2 total_blocks 5
  Totals:
    free 4192063 allocated 52 min_free 4118399 largest_free_block 4128768
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 9156 allocated 159940 min_free 9156
    largest_free_block 7680 alloc_blocks 289 free_blocks 2 total_blocks 291
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 58544 allocated 0 min_free 42160
    largest_free_block 57344 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 227595 allocated 165636 min_free 211063 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 9156 allocated 159940 min_free 9156
    largest_free_block 7680 alloc_blocks 289 free_blocks 2 total_blocks 291
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 169051 allocated 165636 min_free 168903 largest_free_block 110592

In short, if the rlottie structures (result of parsing) are in SPIRAM, then rendering is 4x slower. Please ignore the IRAM advantage here, since there's no color conversion in this test for IRAM.
So clearly, the output rendering buffer can be in SPIRAM with no big impact on the performance (only 25% slower), since it's pipelined IIUC.
However, it's super important that the parser allocates in DRAM (in that case, the rendering speed is 3.3x faster)

@X-Ryl669
Copy link

X-Ryl669 commented Mar 12, 2022

For information, parsing the file takes 26333964 cycles (so 21.7x longer than rendering) in DRAM and 33590308 cycles (so 8.20x longer than rendering) in SPIRAM.

@tvanfossen
Copy link
Author

tvanfossen commented Mar 14, 2022

This is fantastic info!!! Thanks for digging into this more, way more precise than what I was doing

To your point of modifying rlotties allocator to be able for DRAM/IRAM/SPIRAM/wherever else it could go, at what point do you think its worthwhile for an official lottie plugin for LVGL, rather than relying upon rlottie and any nuance that comes with it? Presumably the main value would be in the embedded side, which is a niche of LVGL to begin with. Is something like a ThorVG port into LVGL a better route for rendering lottie than rlottie? Is lottie the ideal format for rendering pre-built animations from a file, as opposed to Rive, Gifs, etc.

Re-invent the wheel vs square peg/round hole

@X-Ryl669
Copy link

For the code, here it's:

// We need types 
#include "Logger.hpp"
// We need performance counter
#include <perfmon.h>

static const uint32_t counters[] = {
        XTPERF_CNT_CYCLES, XTPERF_MASK_CYCLES, // total cycles
        XTPERF_CNT_INSN, XTPERF_MASK_INSN_ALL, // total instructions
        XTPERF_CNT_D_LOAD_U1, XTPERF_MASK_D_LOAD_LOCAL_MEM, // Mem read
        XTPERF_CNT_D_STORE_U1, XTPERF_MASK_D_STORE_LOCAL_MEM, // Mem write
        XTPERF_CNT_BUBBLES, XTPERF_MASK_BUBBLES_ALL &(~XTPERF_MASK_BUBBLES_R_HOLD_REG_DEP),  // wait for other reasons
        XTPERF_CNT_BUBBLES, XTPERF_MASK_BUBBLES_R_HOLD_REG_DEP,           // Wait for register dependency
        XTPERF_CNT_OVERFLOW, XTPERF_MASK_OVERFLOW,               // Last test cycle
    };


struct PerfCounter
{
    static void dump(void *name, uint32_t select, uint32_t mask, uint32_t value)
    {
        Log(Logger::SystemInfo, "Bench (%s) s%u m%u v:%u", (const char*)name, select, mask, value);
    }

    static void benchmark(const char * name, void (*func)(void *), void * arg) {        
        xtensa_perfmon_config_t config;
        Zero(config);

        config.counters_size = ArrSz(counters);
        config.select_mask = counters;
        config.repeat_count = 200;
        config.max_deviation = 1;
        config.call_function = func;
        config.call_params = arg;
        config.callback = xtensa_perfmon_view_cb;
        config.callback_params = stdout;
        config.tracelevel = -1;

        Log(Logger::SystemInfo, "Bench %s\n---------------------", name);
        xtensa_perfmon_exec(&config);
    }
};

#include <rlottie_capi.h>

#define renderSize 64

struct Lottie
{
    Lottie_Animation * animation;
    uint32 *           renderingBuffer;

    Lottie(const void * data, size_t len) : renderingBuffer(0) {
        animation = lottie_animation_from_rodata((const char *)data, len, "");
        if (!animation) {
            Log(Logger::SystemError, "Can't open lottie");
        }
       // double fps = lottie_animation_get_framerate(animation);
       // size_t duration = lottie_animation_get_totalframe(animation);
       // Log(Logger::SystemInfo, "Opened animation %u frames %g", duration, fps);
    }

    static void convert_to_rgba5658(uint32_t * pix, uint8_t * dest, const size_t width, const size_t height)
    {
        /* rlottie draws in ARGB32 format, but LVGL only deal with RGB565 format with (optional 8 bit alpha channel)
        so convert in place here the received buffer to LVGL format. */
        uint32_t * src = pix;
        for(size_t y = 0; y < height; y++) {
            /* Convert a 4 bytes per pixel in format ARGB to R5G6B5A8 format
                naive way:
                            r = ((c & 0xFF0000) >> 19)
                            g = ((c & 0xFF00) >> 10)
                            b = ((c & 0xFF) >> 3)
                            rgb565 = (r << 11) | (g << 5) | b
                            a = c >> 24;
                That's 3 mask, 6 bitshift and 2 or operations

                A bit better:
                            r = ((c & 0xF80000) >> 8)
                            g = ((c & 0xFC00) >> 5)
                            b = ((c & 0xFF) >> 3)
                            rgb565 = r | g | b
                            a = c >> 24;
                That's 3 mask, 3 bitshifts and 2 or operations */
            for(size_t x = 0; x < width; x++) {
                uint32_t in = src[x];
    #if LV_COLOR_16_SWAP == 0
                uint16_t r = (uint16_t)(((in & 0xF80000) >> 8) | ((in & 0xFC00) >> 5) | ((in & 0xFF) >> 3));
    #else
                /* We want: rrrr rrrr GGGg gggg bbbb bbbb => gggb bbbb rrrr rGGG */
                uint16_t r = (uint16_t)(((in & 0xF80000) >> 16) | ((in & 0xFC00) >> 13) | ((in & 0x1C00) << 3) | ((in & 0xF8) << 5));
    #endif

                memcpy(dest, &r, sizeof(r));
                dest[sizeof(r)] = (uint8_t)(in >> 24);
                dest += 3;
            }
            src += width;
        }
    }

    void render(bool allowUnaligned = true) {
        if (!renderingBuffer) return;
        lottie_animation_render(animation, 12, renderingBuffer, renderSize, renderSize, renderSize * 4);
        if (allowUnaligned) convert_to_rgba5658(renderingBuffer, (uint8_t*)renderingBuffer, renderSize, renderSize);
    }

    ~Lottie() {
        free0(renderingBuffer);
        lottie_animation_destroy(animation);
    }
};

void psramMode(void * obj) {
    Lottie * lottie = (Lottie*)obj;
    lottie->render();
}

void iramMode(void * obj) {
    Lottie * lottie = (Lottie*)obj;
    lottie->render(false);
}

void parserTime(void * obj) {
    ROString * txt = (ROString *)obj;
    Lottie lottie(txt->getData(), txt->getLength());
    (void)lottie;
}

void dumpHeap() {
    heap_caps_print_heap_info(MALLOC_CAP_SPIRAM);
    heap_caps_print_heap_info(MALLOC_CAP_32BIT | MALLOC_CAP_INTERNAL);
    heap_caps_print_heap_info(MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL);
}

enum TestMode
{
    SPIRAM = 0,
    IRAM = 1,
    DRAM = 2,
};
static TestMode  _testMode;
using namespace std;
void * _new(size_t size)
{
    switch(_testMode) {
    case SPIRAM: return heap_caps_aligned_alloc(4, size, MALLOC_CAP_SPIRAM);
    case IRAM:   return heap_caps_aligned_alloc(4, size, MALLOC_CAP_32BIT | MALLOC_CAP_INTERNAL);
    case DRAM:   return heap_caps_aligned_alloc(4, size, MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL);  
    default: return malloc(size);
    }
}
void * operator new(size_t size)
{
    return _new(size);
}
void * operator new[](size_t size)
{
    return _new(size);
}

void operator delete(void * p)
{
    free(p);
}
void operator delete[](void * p)
{
    free(p);
}

void benchmarkRlottie(void*)
{
    ROString iconData = efs.getFile("faucet.json");
/*    {
        _testMode = SPIRAM;
        PerfCounter::benchmark("parser", &parserTime, &iconData);
    }
*/
    dumpHeap();
    {
        _testMode = SPIRAM;
        Lottie lottie(iconData.getData(), iconData.getLength());
        lottie.renderingBuffer = (uint32_t*)heap_caps_aligned_alloc(4, renderSize*4*renderSize, MALLOC_CAP_SPIRAM);
        if (!lottie.renderingBuffer) {
            Log(Logger::SystemError, "Can't allocate SPIRAM");
            return;
        }
        memset(lottie.renderingBuffer, 0, renderSize*renderSize*4);
        PerfCounter::benchmark("psram", &psramMode, &lottie);
        dumpHeap();
    }
//    free0(lottie.renderingBuffer);

    {
        _testMode = DRAM;
        Lottie lottie(iconData.getData(), iconData.getLength());
        lottie.renderingBuffer = (uint32_t*)heap_caps_aligned_alloc(4, renderSize*4*renderSize, MALLOC_CAP_32BIT | MALLOC_CAP_INTERNAL);
        if (!lottie.renderingBuffer) {
            Log(Logger::SystemError, "Can't allocate SPIRAM");
            return;
        }
        memset(lottie.renderingBuffer, 0, renderSize*renderSize*4);
        PerfCounter::benchmark("iram", &iramMode, &lottie);
        dumpHeap();
    }

    // lottie.renderingBuffer = (uint32_t*)heap_caps_aligned_alloc(4, renderSize*4*renderSize, MALLOC_CAP_32BIT);
    // memset(lottie.renderingBuffer, 0, renderSize*renderSize*4);
    // if (!lottie.renderingBuffer) {
    //     Log(Logger::SystemError, "Can't allocate IRAM");
    //     return;
    // }
    // PerfCounter::benchmark("iram", &iramMode, &lottie);

    // free0(lottie.renderingBuffer);
    {
        _testMode = DRAM;
        Lottie lottie(iconData.getData(), iconData.getLength());
        lottie.renderingBuffer = (uint32_t*)heap_caps_aligned_alloc(4, renderSize*4*renderSize, MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL);
        if (!lottie.renderingBuffer) {
            Log(Logger::SystemError, "Can't allocate SPIRAM");
            return;
        }
        memset(lottie.renderingBuffer, 0, renderSize*renderSize*4);
        PerfCounter::benchmark("dram", &psramMode, &lottie);
        dumpHeap();
    }

    // lottie.renderingBuffer = (uint32_t*)heap_caps_aligned_alloc(4, renderSize*4*renderSize, MALLOC_CAP_8BIT | MALLOC_CAP_INTERNAL);
    // memset(lottie.renderingBuffer, 0, renderSize*renderSize*4);
    // if (!lottie.renderingBuffer) {
    //     Log(Logger::SystemError, "Can't allocate IRAM");
    //     return;
    // }
    // PerfCounter::benchmark("dram", &psramMode, &lottie);


    while(1) { delayMs(1000); }
}

@X-Ryl669
Copy link

X-Ryl669 commented Mar 14, 2022

And for the other questions:

I'm working on improving rlottie to consume less memory. Right now, I'm changing the code to defer to another vector class that's not using exponential-grow to allocate (so when allocating 17 elements, you don't pay for 32 elements like standard STL). This is not a huge work and it should hit my memshrink branch tomorrow. It'll be slower than the classical std::vector unless the code is modified to pre-reserve the expected size (I'm also looking at this, but the rlottie code is a bit convoluted). I'm sure the overhead will worth it, since the benchmark above show that a 12kB JSON file consumes 74kB heap memory for internal rlottie structures once parsed. The less memory it uses, the more compact it'll be in SPIRAM and it'll fit the 32KB cache better so it might be faster in the end (also, parsing is done once, but rendering is done for each frame).

Then there's the possibility to change the allocator. Unfortunately, in C++, if you override operator new, it's global for all program (that's one of the historical dumbness of C++), or you need to overload per class (which is impossible to do for rlottie, too much work).

Yet, I think I've found a solution that might help more people. My idea is to implement a operator new override that's using a thread local storage variable telling where to allocate. So the code would be something like this:

render_rlottie() {

   set_custom_allocator(DRAM);
   lottie = lottie_animation_from_file(...);   // This calls new in all structure and that'll be overridden above
   set_custom_allocator(Default); // Back to default allocator
}

// Allocator
thread_local mode; // Either DRAM, IRAM, SPIRAM, whatever
void * operator new(size_t n) {
    switch (mode) {
      case Default: return ::malloc(n);
      case DRAM: return ::heap_cap_malloc(DRAM, n);
...
}
[...]

For LVGL, I think it'll need more work to implement a capability aware allocator (a smarter lv_malloc)
The idea would be to call lv_rlottie_create() with some intent to tell if it's going to be used once, if it's rotated, resized and so on. With that intent, we could decide what heap to use smartly. It makes sense to use SPIRAM for lottie files used as vector images only (no animation), or when rotation is required (in that case it's rendered to a complete buffer anyway). But for animation where rendering needs to happen multiple times per second, then we should prefer DRAM for the lottie internal structures, and SPIRAM for the rendering buffer (unless there's enough DRAM/IRAM for those too).
I'll probably rewrite the color conversion code (going from ARGB 32 bits to RGB565A8 (24bits) to process 3 pixels at once so it's doing only 32 bits read and writes to the initial buffer, and it avoid unaligned access processor exception on IRAM.

In short, for light interfaces with 3 animated icons, I think we can work with the DRAM heap (that's around 180kB) and gain a good 3 to 4x the performance instead of using the SPIRAM.

Last possible solution to optimize more would be to change rlottie code to use hash instead of strings for its few unordered_map in order to avoid most useless heap allocation for std::string. Maybe we could gain another 1 to 2kB here. But it's not the low hanging fruit anymore...

I don't think ThorVG will be the panacea either. It has many advantages but a lot of cons for me:

  1. There are too few software to create rive file and none are open source (unlike lottie format that's usable with Glaxnimate, Synfig, Haiku, ...)
  2. Code doesn't fit true embedded platform very well
  3. SVG is larger than lottie format (XML vs JSON), so for simple vectorial icons stored in flash, lottie is better (and with my branch, it's not required to load the json file in the heap, it can be parsed from flash directly). Their binary mode (.thorvg format) is just about the same size as uncompressed lottie file.
  4. lottie support in ThorVG isn't there. I doubt it'll happen soon
  5. rive support is another repository that's using enlightenment's primitive so it needs to be disentangled to be used in LVGL.
  6. There's no cmake support, so it has to be written from scratch.

@X-Ryl669
Copy link

X-Ryl669 commented Mar 16, 2022

I've tried with a vector class that's not using exponential growth and got these results:

Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4192095 allocated 20 min_free 4192095
    largest_free_block 4128768 alloc_blocks 1 free_blocks 1 total_blocks 2
  Totals:
    free 4192095 allocated 20 min_free 4192095 largest_free_block 4128768
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 1328 allocated 4668 min_free 152
    largest_free_block 1024 alloc_blocks 51 free_blocks 1 total_blocks 52
  At 0x3ffb67b8 len 170056 free 83568 allocated 85528 min_free 83568
    largest_free_block 81920 alloc_blocks 43 free_blocks 1 total_blocks 44
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 58544 allocated 0 min_free 58516
    largest_free_block 57344 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 303035 allocated 90196 min_free 301831 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 1328 allocated 4668 min_free 152
    largest_free_block 1024 alloc_blocks 51 free_blocks 1 total_blocks 52
  At 0x3ffb67b8 len 170056 free 83568 allocated 85528 min_free 83568
    largest_free_block 81920 alloc_blocks 43 free_blocks 1 total_blocks 44
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 244491 allocated 90196 min_free 243315 largest_free_block 110592
Bench psram
---------------------
Value =   4399859, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    392966, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     50446, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     42212, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =   2634778, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     91183, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   6600012, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4107583 allocated 84532 min_free 4096387
    largest_free_block 4063232 alloc_blocks 235 free_blocks 4 total_blocks 239
  Totals:
    free 4107583 allocated 84532 min_free 4096387 largest_free_block 4063232
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 304 allocated 5692 min_free 152
    largest_free_block 0 alloc_blocks 75 free_blocks 0 total_blocks 75
  At 0x3ffb67b8 len 170056 free 82836 allocated 86260 min_free 82836
    largest_free_block 81920 alloc_blocks 56 free_blocks 1 total_blocks 57
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 58544 allocated 0 min_free 58516
    largest_free_block 57344 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 301279 allocated 91952 min_free 301099 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 32011
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 304 allocated 5692 min_free 152
    largest_free_block 0 alloc_blocks 75 free_blocks 0 total_blocks 75
  At 0x3ffb67b8 len 170056 free 82836 allocated 86260 min_free 82836
    largest_free_block 81920 alloc_blocks 56 free_blocks 1 total_blocks 57
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 242735 allocated 91952 min_free 242583 largest_free_block 110592
Bench iram
---------------------
Value =   1203417, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    281808, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     46265, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     25700, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    710148, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     82865, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   1393218, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4192063 allocated 52 min_free 4096387
    largest_free_block 4128768 alloc_blocks 3 free_blocks 2 total_blocks 5
  Totals:
    free 4192063 allocated 52 min_free 4096387 largest_free_block 4128768
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 20815
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 14740 allocated 154356 min_free 3560
    largest_free_block 13312 alloc_blocks 288 free_blocks 3 total_blocks 291
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 42160 allocated 16384 min_free 42160
    largest_free_block 40960 alloc_blocks 1 free_blocks 1 total_blocks 2
  Totals:
    free 216795 allocated 176436 min_free 194271 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 32011 allocated 0 min_free 20815
    largest_free_block 31744 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 14740 allocated 154356 min_free 3560
    largest_free_block 13312 alloc_blocks 288 free_blocks 3 total_blocks 291
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 14636
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 174635 allocated 160052 min_free 152111 largest_free_block 110592
Bench dram
---------------------
Value =   1319464, select =  0, mask = 0001.  Counts cycles.
                  Amount of cycles
Value =    392995, select =  2, mask = 8dff.  Successfully Retired Instructions.
                  JX instructions
                  CALLXn instructions
                  return instructions (RET, RETW, ...)
                  supervisor return instructions (RFDE, RFE, RFI, RFWO, RFWU)
                  Conditional branch instructions where execution
                  transfers to the target (aka. taken branch),
                   or loopgtz/loopnez instr where execution skips
                   the loop (aka. not-taken loop)
                  J instr
                  CALLn instr
                  Conditional branch instr where execution
                   falls through (aka. not-taken branch)
                  Loop instr where execution falls into loop (aka. taken loop)
                  Last inst of loop and execution transfers
                   to LBEG (aka. loopback taken)
                  Last inst of loop and execution falls
                   through to LEND (aka. loopback fallthrough)
                  Non-branch instr (aka. non-CTI)
Value =     50380, select = 10, mask = 0004.  Load Instruction (Data Memory).
                  Load from local memory i.e. DataRAM, DataROM, InstRAM, InstROM
Value =     42090, select = 11, mask = 0004.  Store Instruction (Data Memory).
                  Store to local memory i.e. DataRAM, InstRAM
Value =    814010, select =  6, mask = 01ed.  Hold and Other Bubble cycles.
                  Processor domain PSO bubble
                  R hold caused by Data Cache miss(unused)
                  R hold caused by Store release
                  R hold caused by MEMW, EXTW or EXCW
                  R hold caused by Halt instruction (TX only)
                  CTI bubble (e.g. branch delay slot)
                  WAITI bubble i.e. a cycle spent in WaitI power down mode.
Value =     91127, select =  6, mask = 0010.  Hold and Other Bubble cycles.
                  R hold caused by register dependency
Value =         0, select =  1, mask = 0001.  Overflow of counter.
                  Overflow counter
Value =   1510296, select =  0, mask = 6c696857.  Counts cycles.
                  Amount of cycles
Heap summary for capabilities 0x00000400:
  At 0x3f800000 len 4194303 free 4192063 allocated 52 min_free 4096387
    largest_free_block 4128768 alloc_blocks 3 free_blocks 2 total_blocks 5
  Totals:
    free 4192063 allocated 52 min_free 4096387 largest_free_block 4128768
Heap summary for capabilities 0x00000802:
  At 0x3ffbb744 len 32767 free 20811 allocated 11200 min_free 9615
    largest_free_block 10752 alloc_blocks 1 free_blocks 2 total_blocks 3
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 9552 allocated 159544 min_free 1360
    largest_free_block 8192 alloc_blocks 288 free_blocks 2 total_blocks 290
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 3444
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x40091818 len 59368 free 58544 allocated 0 min_free 42160
    largest_free_block 57344 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 216791 allocated 176440 min_free 169679 largest_free_block 110592
Heap summary for capabilities 0x00000804:
  At 0x3ffbb744 len 32767 free 20811 allocated 11200 min_free 9615
    largest_free_block 10752 alloc_blocks 1 free_blocks 2 total_blocks 3
  At 0x3ffae6e0 len 6432 free 300 allocated 5696 min_free 152
    largest_free_block 0 alloc_blocks 74 free_blocks 0 total_blocks 74
  At 0x3ffb67b8 len 170056 free 9552 allocated 159544 min_free 1360
    largest_free_block 8192 alloc_blocks 288 free_blocks 2 total_blocks 290
  At 0x3ffe0440 len 15072 free 14636 allocated 0 min_free 3444
    largest_free_block 14336 alloc_blocks 0 free_blocks 1 total_blocks 1
  At 0x3ffe4350 len 113840 free 112948 allocated 0 min_free 112948
    largest_free_block 110592 alloc_blocks 0 free_blocks 1 total_blocks 1
  Totals:
    free 158247 allocated 176440 min_free 127519 largest_free_block 110592

In short, it's slower (as expected), but consume more memory (unexpected) than standard std::vector class. Maybe I'm missing something obvious here, as it shouldn't behave like this. There are few places in the lottie code that use std::copy_n(... std::back_inserter) without reserving the memory area before and for these places, it should call std::vector::push_back repeatedly which, in turn reserve an exponential grown buffer. My implementation only grow to the minimum possible size so it should save memory but it's clearly no acting this way. Maybe I have some memory leaks here.

It's in my memShrinkVector branch if you want to test.

@X-Ryl669
Copy link

Ok, I've made some progress here on understanding why it takes more memory. The issue is not the consumed heap memory (which is lower by 4kB) but the overhead of the many small allocations in the heap itself. Here's the output of valgrind with massif on my vector class:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 52  1,957,788,763       12,278,728       12,175,852       102,876            0
 53  1,982,199,067       12,277,960       12,175,076       102,884            0
 54  2,005,128,633       12,277,384       12,174,404       102,980            0
 55  2,028,067,173       12,359,848       12,256,836       103,012            0
 56  2,060,992,276       12,277,672       12,174,636       103,036            0
 57  2,082,981,179       12,268,312       12,165,404       102,908            0
 58  2,109,541,680       12,276,424       12,173,500       102,924            0
 59  2,154,041,323       12,276,472       12,173,516       102,956            0
 60  2,177,594,690       12,277,120       12,174,000       103,120            0
 61  2,222,086,994       12,277,120       12,174,000       103,120            0

vs the output with std::vector:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 53  2,459,686,925       12,296,816       12,194,035       102,781            0
 54  2,510,345,448       12,302,464       12,199,747       102,717            0
 55  2,527,458,246       12,297,680       12,194,915       102,765            0
 56  2,548,331,031       12,298,512       12,195,755       102,757            0
 57  2,569,239,055       12,364,152       12,261,387       102,765            0
 58  2,590,159,207       12,298,800       12,196,035       102,765            0
 59  2,611,142,452       12,364,472       12,261,715       102,757            0
 60  2,632,139,929       12,364,616       12,261,851       102,765            0
 61  2,653,156,140       12,299,200       12,196,451       102,749            0
 62  2,674,030,260       12,299,200       12,196,451       102,749            0

We see that the useful heap is smaller for my vector (12,174kiB vs 12,196kiB) but the overhead is much higher (103120 vs 102749). So I guess TLSF in ESP32 is even worst (I've no tool to measure) than my computer's heap algorithm.

So in the end, it doesn't make sense to allocate with my vector here. I'm reverting the commit on the memShrink branch.

@X-Ryl669
Copy link

Re-run the test with adding Vvector::reserve where appropriate, and I'm getting 165280 bytes allocation in DRAM vs 165636 with std::vector. Not sure it's worth it for 356 bytes saved in heap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants