Faster UTF8 strlen implementation (using DSP SIMD) #3479

bolknote · 2024-02-25T14:32:25Z

What's new

Faster strlen implementation (using DSP SIMD) (6 sec vs 39 sec)

Verification

    size_t len = 0;
    FuriHalRtcDateTime curr_dt;
    furi_hal_rtc_get_datetime(&curr_dt);
    uint32_t current_timestamp = furi_hal_rtc_datetime_to_timestamp(&curr_dt);

    FuriString* str1 = furi_string_alloc_set_str("ЯБ12341234");
    FuriString* str2 = furi_string_alloc_set_str("ЯБ1234123");
    FuriString* str3 = furi_string_alloc_set_str("ЯБ123412");
    FuriString* str4 = furi_string_alloc_set_str("ЯБ12341");

    for (int i = 0; i<1000000; i++)
    len += furi_string_utf8_length(str1) +
           furi_string_utf8_length(str2) +
           furi_string_utf8_length(str3) +
           furi_string_utf8_length(str4);

    furi_string_free(str1);
    furi_string_free(str2);
    furi_string_free(str3);
    furi_string_free(str4);

    FuriHalRtcDateTime curr_dt2;
    furi_hal_rtc_get_datetime(&curr_dt2);
    uint32_t current_timestamp2 = furi_hal_rtc_datetime_to_timestamp(&curr_dt2);

    FuriString* str = furi_string_alloc();
    furi_string_printf(str, "%d %lu", len, current_timestamp2 - current_timestamp);
    elements_multiline_text_aligned(
            canvas, 0, 0, AlignLeft, AlignTop, furi_string_get_cstr(str));

    furi_string_free(str);

Checklist (For Reviewer)

PR has description of feature/bug or link to Confluence/Jira task
Description contains actions to verify feature/bugfix
I've built this code, uploaded it to the device and verified feature/bugfix

bolknote · 2024-02-26T08:22:33Z

furi/core/string.c

+
+        __uadd8(0xFFFFFFFF, vec); result = __sel(zero, one);
+        if (result > 0) {
+            if (result == 0x01000000) {


This branch of code can be removed. It doesn't give a good performance boost, but the code gets more complicated.

CookiePLMonster · 2024-02-26T08:25:02Z

Looking at how the first loop always reads 4 bytes at a time - is it not at a risk of reading past the allocated memory bounds?

bolknote · 2024-02-26T08:29:16Z

Looking at how the first loop always reads 4 bytes at a time - is it not at a risk of reading past the allocated memory bounds?

Hi! As far as I know, there is no memory protection on the Flipper. Let knowledgeable people correct me.

CookiePLMonster · 2024-02-26T08:32:45Z

Hi! As far as I know, there is no memory protection on the Flipper. Let knowledgeable people correct me.

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack - which means it'll be fine unless maybe the string lands at the very end of the mapped SRAM? Accessing memory within the 512MB SRAM bounds past the 256KB flipper has is likely going to result in a bus fault, but I have not verified that.

bolknote · 2024-02-26T08:37:06Z

Thanks for the answer!

Hi! As far as I know, there is no memory protection on the Flipper. Let knowledgeable people correct me.

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack - which means it'll be fine unless maybe the string lands at the very end of the mapped SRAM?

When working on another commit (to support Unicode), I noticed that some firmware functions sometimes read a little more memory than allocated, so I thought it is safe.

Accessing memory within the 512MB SRAM bounds past the 256KB flipper has is likely going to result in a bus fault, but I have not verified that.

Could you tell me some way to check this?

CookiePLMonster · 2024-02-26T08:47:01Z

When working on another commit (to support Unicode), I noticed that some firmware functions sometimes read a little more memory than allocated, so I thought it was safe.

It might be, it's just something that raised a red flag for me without having actually verified this on the Flipper.

Could you tell me some way to check this?

Looking at the memory map of Cortex M3 (M4 has an identical map), you could just try reading past the bottom 256KB region of SRAM:

There is also a chance that the memory will just wrap around and you'll read aliases of the bottom 256KB region if you do that.

That said - take a look at string_size that is used by furi_string_size - it is clearly O(1):
https://github.com/P-p-H-d/mlib/blob/62c8ac3e5d4a7a4f8757328e7a80286fde2686b6/m-string.h#L123-L132
Therefore, maybe this entire problem is a non-issue and the optimized loop can just be bounded with

size_t max_len = string_size(str->string) >> 2;
while (max_len > 0)

? It's just a single extra comparison per loop iteration so it's unlikely to be a major hit, and it'll completely close this possible corner case.

bolknote · 2024-02-26T08:47:16Z

Accessing memory within the 512MB SRAM bounds past the 256KB flipper has is likely going to result in a bus fault, but I have not verified that.

In theory, I could check the address of the pointer, would that be enough?

CookiePLMonster · 2024-02-26T08:48:15Z

In theory, I could check the address of the pointer, would that be enough?

Probably, but if you're going to insert an additional check to the loop, IMO it'd be best to just add the length check as I described above.

bolknote · 2024-02-26T08:51:48Z

In theory, I could check the address of the pointer, would that be enough?

Probably, but if you're going to insert an additional check to the loop, IMO it'd be best to just add the length check as I described above.

Yes, thanks you! I will try to delve into this question and add a similar check.

CookiePLMonster · 2024-02-26T08:55:30Z

Yes, thanks you! I will try to delve into this question and add a similar check.

If I'm seeing it right, the bottom loop is already gathering the "remainder", so it might be enough to just change for(;;) into a length check, going 4 bytes at a time.

bolknote · 2024-02-26T08:57:10Z

Yes, thanks you! I will try to delve into this question and add a similar check.

If I'm seeing it right, the bottom loop is already gathering the "remainder", so it might be enough to just change for(;;) into a length check, going 4 bytes at a time.

The problem seems to be not in the lower loop (it will stop when it sees \0), but in the upper one. When reading a 4-byte vector, the instruction may try to read values outside the memory boundary.

This situation, of course, seems to me purely theoretical, since it will most likely mean that the device has run out of memory.

CookiePLMonster · 2024-02-26T08:59:43Z

The problem seems to be not in the lower loop (it will stop when it sees \0), but in the upper one. When reading a 4-byte vector, the instruction may try to read values outside the memory boundary.

My point is that the lower loop will already take care of the remainder of the string that is 3 characters or less, so the top loop can be changed from
for(;;)
to
for(const size_t max_len = string_size(str->string) >> 2; max_len > 0; max_len--)

bolknote · 2024-02-26T09:04:11Z

The problem seems to be not in the lower loop (it will stop when it sees \0), but in the upper one. When reading a 4-byte vector, the instruction may try to read values outside the memory boundary.

My point is that the lower loop will already take care of the remainder of the string that is 3 characters or less, so the top loop can be changed from for(;;) to for(const size_t max_len = string_size(str->string) >> 2; max_len > 0; max_len--)

Oh, I didn't understand at first :) I thought we were still talking about memory limitations. We should try to measure the performance of this version. Thanx!

CookiePLMonster · 2024-02-26T09:30:19Z

We should try to measure the performance of this version.

It will be interesting to know what metrics you get - but the price of a single additional if check per the loop iteration should be negligible, even if on your test case it's going to be noticeable due to you doing a million iterations.

DrZlo13 · 2024-02-26T10:17:18Z

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack

The current thread stack is located on the heap, and there always will be a probability that it is located near the string, but this is not the problem, cos it is protected to "read-only mode". My concern is that you can hit core2-protected RAM.

CookiePLMonster · 2024-02-26T10:22:27Z

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack

The current thread stack is located on the heap, and there always will be a probability that it is located near the string, but this is not the problem, cos it is protected to "read-only mode". My concern is that you can hit core2-protected RAM.

Ah true, I misunderstood the purpose of FuriHalMpuRegionStack, I thought it's a dedicated region at the bottom of the RAM map, just above the null pointer access protection region. This makes more sense though, because there's no guarantee all threads would have fit their stacks in that theoretical region, so arbitrary heap allocations make more sense.

bolknote · 2024-02-26T10:23:44Z

I think I need to rewrite this commit completely. I proceeded from the suggestion that we do not know the binary length of the string.

The rewritten version should be simpler.

bolknote · 2024-02-26T10:25:11Z

It will be interesting to know what metrics you get - but the price of a single additional if check per the loop iteration should be negligible, even if on your test case it's going to be noticeable due to you doing a million iterations.

I took a measurement, the speed did not change :)

CookiePLMonster · 2024-02-26T10:29:19Z

If you don't care about the validity of UTF-8 codepoints, then IIRC utf-8 strlen is as easy as (pseudocode)

len = 0;
for (char ch : str) {
   if ((ch & 0b11000000) != 0b10000000) len++;

This ends up counting all non-continuation bytes, but won't work correctly if the UTF-8 codepoints are malformed.

bolknote · 2024-02-26T10:30:42Z

If you don't care about the validity of UTF-8 codepoints, then IIRC utf-8 strlen is as easy as (pseudocode)
len = 0;
for (char ch : str) {
   if ((ch & 0b11000000) != 0b10000000) len++;
This ends up counting all non-continuation bytes, but won't work correctly if the UTF-8 codepoints are malformed.

It seems (signed char) c >= -64 contains less operation

DrZlo13 · 2024-02-26T10:36:59Z

Undraft when ready.

…e into fast-strlen

bolknote · 2024-02-28T17:22:17Z

Hi! I've done!

CookiePLMonster · 2024-02-28T17:23:53Z

I'm curious, what's the current test benchmark result? Is it still 6-ish seconds?

bolknote · 2024-02-28T17:26:02Z

I'm curious, what's the current test benchmark result? Is it still 6-ish seconds?

Oddly enough, yes. For some reason I thought it would be faster.

skotopes · 2024-03-04T08:33:22Z

So if it's not faster then why we need it?

CookiePLMonster · 2024-03-04T08:57:03Z

So if it's not faster then why we need it?

If I understood correctly, this implementation is not faster than the initial one that didn't check the binary length up-front and possibly accessed memory out of bounds. It is still much faster than the "naive" approach:

(6 sec vs 39 sec)

skotopes · 2024-03-04T09:13:18Z

I'm very doubt that we need it: we don't work with strings a lot, and when we do they are quite small.
Plus this code is write only: very cryptic, no test coverage. No one will be able(will want) to support it.

At the same time I do things that you've made quite interesting thing and you should try to contribute it to arm dsp lib.

CookiePLMonster · 2024-03-04T09:20:13Z

you should try to contribute it to arm dsp lib.

That's a fair point, does CMSIS offer string manipulation? CMSIS-DSP doesn't seem to: https://www.keil.com/pack/doc/CMSIS/DSP/html/index.html

skotopes · 2024-03-04T09:37:15Z

@CookiePLMonster hmm, that was closest thing I was able to think of. Not sure where else it can be useful.

CookiePLMonster · 2024-03-04T09:38:37Z

Does mlib itself have platform-specific optimizations? If so, maybe they'd be interested in that which also means Flipper would eventually get this change back, just in mlib form.

skotopes · 2024-03-04T09:44:27Z

Not really, but you may try to ask m-lib author, maybe he will be interested in it

bolknote · 2024-03-04T10:57:13Z

@skotopes

Hi!

I understand your doubts! I want to clarify two things for myself. If I wrote unit tests for this function and explained in detail how it works in the comment, would the commit be accepted?

I'm working in another commit on Unicode support in Flipper and the API functions I'm rewriting have operations to get Unicode characters by index. Since, in my experience, these operations are always slower than binary operations, I thought I'd try to speed them up.

So if there is hope that the commit can be accepted, I would like to extend the work on it.

skotopes · 2024-03-04T12:03:27Z

There are 2 groups for PRs like this one(low level, cryptic, blackmagic):

Essential. Brings visible, significant improvements. Those are going to be accepted.
Non essential. Speedups something that not used that often, no visible improvements. Those are not.

All optimizations of that kind requires unit tests with cross testing between optimized and un-optimized versions.

The best option is to ping us in GH issues or on discord before starting anything like that.

Faster strlen implementation (using DSP SIMD)

bc662d6

bolknote requested review from skotopes, DrZlo13, hedger and gsurkov as code owners February 25, 2024 14:32

bolknote commented Feb 26, 2024

View reviewed changes

DrZlo13 marked this pull request as draft February 26, 2024 10:36

Merge branch 'dev' into fast-strlen

2268eec

hedger added the Core+Services HAL, furi & core system services label Feb 26, 2024

Faster strlen implementation (DSP SIMD, improved)

fc4f7e8

bolknote added 2 commits February 28, 2024 20:18

Merge branch 'fast-strlen' of github.com:bolknote/flipperzero-firmwar…

745f049

…e into fast-strlen

Merge branch 'flipperdevices:dev' into fast-strlen

0464303

bolknote marked this pull request as ready for review February 28, 2024 17:23

skotopes closed this Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster UTF8 strlen implementation (using DSP SIMD) #3479

Faster UTF8 strlen implementation (using DSP SIMD) #3479

bolknote commented Feb 25, 2024

bolknote Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024 •

edited

bolknote commented Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024 •

edited

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024 •

edited

DrZlo13 commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

DrZlo13 commented Feb 26, 2024

bolknote commented Feb 28, 2024

CookiePLMonster commented Feb 28, 2024

bolknote commented Feb 28, 2024

skotopes commented Mar 4, 2024

CookiePLMonster commented Mar 4, 2024

skotopes commented Mar 4, 2024

CookiePLMonster commented Mar 4, 2024

skotopes commented Mar 4, 2024

CookiePLMonster commented Mar 4, 2024

skotopes commented Mar 4, 2024

bolknote commented Mar 4, 2024

skotopes commented Mar 4, 2024

Faster UTF8 strlen implementation (using DSP SIMD) #3479

Faster UTF8 strlen implementation (using DSP SIMD) #3479

Conversation

bolknote commented Feb 25, 2024

What's new

Verification

Checklist (For Reviewer)

bolknote Feb 26, 2024 • edited

Choose a reason for hiding this comment

CookiePLMonster commented Feb 26, 2024 • edited

bolknote commented Feb 26, 2024 • edited

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024 • edited

CookiePLMonster commented Feb 26, 2024 • edited

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024 • edited

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024 • edited

DrZlo13 commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

bolknote commented Feb 26, 2024

CookiePLMonster commented Feb 26, 2024

bolknote commented Feb 26, 2024

DrZlo13 commented Feb 26, 2024

bolknote commented Feb 28, 2024

CookiePLMonster commented Feb 28, 2024

bolknote commented Feb 28, 2024

skotopes commented Mar 4, 2024

CookiePLMonster commented Mar 4, 2024

skotopes commented Mar 4, 2024

CookiePLMonster commented Mar 4, 2024

skotopes commented Mar 4, 2024

CookiePLMonster commented Mar 4, 2024

skotopes commented Mar 4, 2024

bolknote commented Mar 4, 2024

skotopes commented Mar 4, 2024

bolknote Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024 •

edited

bolknote commented Feb 26, 2024 •

edited

bolknote commented Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024 •

edited

bolknote commented Feb 26, 2024 •

edited

CookiePLMonster commented Feb 26, 2024 •

edited