Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster UTF8 strlen implementation (using DSP SIMD) #3479

Closed
wants to merge 5 commits into from

Conversation

bolknote
Copy link
Contributor

What's new

  • Faster strlen implementation (using DSP SIMD) (6 sec vs 39 sec)

Verification

    size_t len = 0;
    FuriHalRtcDateTime curr_dt;
    furi_hal_rtc_get_datetime(&curr_dt);
    uint32_t current_timestamp = furi_hal_rtc_datetime_to_timestamp(&curr_dt);

    FuriString* str1 = furi_string_alloc_set_str("ЯБ12341234");
    FuriString* str2 = furi_string_alloc_set_str("ЯБ1234123");
    FuriString* str3 = furi_string_alloc_set_str("ЯБ123412");
    FuriString* str4 = furi_string_alloc_set_str("ЯБ12341");

    for (int i = 0; i<1000000; i++)
    len += furi_string_utf8_length(str1) +
           furi_string_utf8_length(str2) +
           furi_string_utf8_length(str3) +
           furi_string_utf8_length(str4);

    furi_string_free(str1);
    furi_string_free(str2);
    furi_string_free(str3);
    furi_string_free(str4);

    FuriHalRtcDateTime curr_dt2;
    furi_hal_rtc_get_datetime(&curr_dt2);
    uint32_t current_timestamp2 = furi_hal_rtc_datetime_to_timestamp(&curr_dt2);

    FuriString* str = furi_string_alloc();
    furi_string_printf(str, "%d %lu", len, current_timestamp2 - current_timestamp);
    elements_multiline_text_aligned(
            canvas, 0, 0, AlignLeft, AlignTop, furi_string_get_cstr(str));

    furi_string_free(str);

Checklist (For Reviewer)

  • PR has description of feature/bug or link to Confluence/Jira task
  • Description contains actions to verify feature/bugfix
  • I've built this code, uploaded it to the device and verified feature/bugfix


__uadd8(0xFFFFFFFF, vec); result = __sel(zero, one);
if (result > 0) {
if (result == 0x01000000) {
Copy link
Contributor Author

@bolknote bolknote Feb 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch of code can be removed. It doesn't give a good performance boost, but the code gets more complicated.

@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Feb 26, 2024

Looking at how the first loop always reads 4 bytes at a time - is it not at a risk of reading past the allocated memory bounds?

@bolknote
Copy link
Contributor Author

bolknote commented Feb 26, 2024

Looking at how the first loop always reads 4 bytes at a time - is it not at a risk of reading past the allocated memory bounds?

Hi! As far as I know, there is no memory protection on the Flipper. Let knowledgeable people correct me.

@CookiePLMonster
Copy link
Contributor

Hi! As far as I know, there is no memory protection on the Flipper. Let knowledgeable people correct me.

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack - which means it'll be fine unless maybe the string lands at the very end of the mapped SRAM? Accessing memory within the 512MB SRAM bounds past the 256KB flipper has is likely going to result in a bus fault, but I have not verified that.

@bolknote
Copy link
Contributor Author

bolknote commented Feb 26, 2024

Thanks for the answer!

Hi! As far as I know, there is no memory protection on the Flipper. Let knowledgeable people correct me.

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack - which means it'll be fine unless maybe the string lands at the very end of the mapped SRAM?

When working on another commit (to support Unicode), I noticed that some firmware functions sometimes read a little more memory than allocated, so I thought it is safe.

Accessing memory within the 512MB SRAM bounds past the 256KB flipper has is likely going to result in a bus fault, but I have not verified that.

Could you tell me some way to check this?

@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Feb 26, 2024

When working on another commit (to support Unicode), I noticed that some firmware functions sometimes read a little more memory than allocated, so I thought it was safe.

It might be, it's just something that raised a red flag for me without having actually verified this on the Flipper.

Could you tell me some way to check this?

Looking at the memory map of Cortex M3 (M4 has an identical map), you could just try reading past the bottom 256KB region of SRAM:
image
There is also a chance that the memory will just wrap around and you'll read aliases of the bottom 256KB region if you do that.

That said - take a look at string_size that is used by furi_string_size - it is clearly O(1):
https://github.com/P-p-H-d/mlib/blob/62c8ac3e5d4a7a4f8757328e7a80286fde2686b6/m-string.h#L123-L132
Therefore, maybe this entire problem is a non-issue and the optimized loop can just be bounded with

size_t max_len = string_size(str->string) >> 2;
while (max_len > 0)

? It's just a single extra comparison per loop iteration so it's unlikely to be a major hit, and it'll completely close this possible corner case.

@bolknote
Copy link
Contributor Author

Accessing memory within the 512MB SRAM bounds past the 256KB flipper has is likely going to result in a bus fault, but I have not verified that.

In theory, I could check the address of the pointer, would that be enough?

@CookiePLMonster
Copy link
Contributor

In theory, I could check the address of the pointer, would that be enough?

Probably, but if you're going to insert an additional check to the loop, IMO it'd be best to just add the length check as I described above.

@bolknote
Copy link
Contributor Author

In theory, I could check the address of the pointer, would that be enough?

Probably, but if you're going to insert an additional check to the loop, IMO it'd be best to just add the length check as I described above.

Yes, thanks you! I will try to delve into this question and add a similar check.

@CookiePLMonster
Copy link
Contributor

Yes, thanks you! I will try to delve into this question and add a similar check.

If I'm seeing it right, the bottom loop is already gathering the "remainder", so it might be enough to just change for(;;) into a length check, going 4 bytes at a time.

@bolknote
Copy link
Contributor Author

bolknote commented Feb 26, 2024

Yes, thanks you! I will try to delve into this question and add a similar check.

If I'm seeing it right, the bottom loop is already gathering the "remainder", so it might be enough to just change for(;;) into a length check, going 4 bytes at a time.

The problem seems to be not in the lower loop (it will stop when it sees \0), but in the upper one. When reading a 4-byte vector, the instruction may try to read values outside the memory boundary.

This situation, of course, seems to me purely theoretical, since it will most likely mean that the device has run out of memory.

@CookiePLMonster
Copy link
Contributor

The problem seems to be not in the lower loop (it will stop when it sees \0), but in the upper one. When reading a 4-byte vector, the instruction may try to read values outside the memory boundary.

My point is that the lower loop will already take care of the remainder of the string that is 3 characters or less, so the top loop can be changed from
for(;;)
to
for(const size_t max_len = string_size(str->string) >> 2; max_len > 0; max_len--)

@bolknote
Copy link
Contributor Author

The problem seems to be not in the lower loop (it will stop when it sees \0), but in the upper one. When reading a 4-byte vector, the instruction may try to read values outside the memory boundary.

My point is that the lower loop will already take care of the remainder of the string that is 3 characters or less, so the top loop can be changed from for(;;) to for(const size_t max_len = string_size(str->string) >> 2; max_len > 0; max_len--)

Oh, I didn't understand at first :) I thought we were still talking about memory limitations. We should try to measure the performance of this version. Thanx!

@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Feb 26, 2024

We should try to measure the performance of this version.

It will be interesting to know what metrics you get - but the price of a single additional if check per the loop iteration should be negligible, even if on your test case it's going to be noticeable due to you doing a million iterations.

@DrZlo13
Copy link
Member

DrZlo13 commented Feb 26, 2024

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack

The current thread stack is located on the heap, and there always will be a probability that it is located near the string, but this is not the problem, cos it is protected to "read-only mode". My concern is that you can hit core2-protected RAM.

@CookiePLMonster
Copy link
Contributor

Cortex M4 has a memory protection unit and the stack area is protected, but those allocations are never on the stack

The current thread stack is located on the heap, and there always will be a probability that it is located near the string, but this is not the problem, cos it is protected to "read-only mode". My concern is that you can hit core2-protected RAM.

Ah true, I misunderstood the purpose of FuriHalMpuRegionStack, I thought it's a dedicated region at the bottom of the RAM map, just above the null pointer access protection region. This makes more sense though, because there's no guarantee all threads would have fit their stacks in that theoretical region, so arbitrary heap allocations make more sense.

@bolknote
Copy link
Contributor Author

I think I need to rewrite this commit completely. I proceeded from the suggestion that we do not know the binary length of the string.

The rewritten version should be simpler.

@bolknote
Copy link
Contributor Author

It will be interesting to know what metrics you get - but the price of a single additional if check per the loop iteration should be negligible, even if on your test case it's going to be noticeable due to you doing a million iterations.

I took a measurement, the speed did not change :)

@CookiePLMonster
Copy link
Contributor

If you don't care about the validity of UTF-8 codepoints, then IIRC utf-8 strlen is as easy as (pseudocode)

len = 0;
for (char ch : str) {
   if ((ch & 0b11000000) != 0b10000000) len++;

This ends up counting all non-continuation bytes, but won't work correctly if the UTF-8 codepoints are malformed.

@bolknote
Copy link
Contributor Author

If you don't care about the validity of UTF-8 codepoints, then IIRC utf-8 strlen is as easy as (pseudocode)

len = 0;
for (char ch : str) {
   if ((ch & 0b11000000) != 0b10000000) len++;

This ends up counting all non-continuation bytes, but won't work correctly if the UTF-8 codepoints are malformed.

It seems (signed char) c >= -64 contains less operation

@DrZlo13 DrZlo13 marked this pull request as draft February 26, 2024 10:36
@DrZlo13
Copy link
Member

DrZlo13 commented Feb 26, 2024

Undraft when ready.

@hedger hedger added the Core+Services HAL, furi & core system services label Feb 26, 2024
@bolknote
Copy link
Contributor Author

Hi! I've done!

@bolknote bolknote marked this pull request as ready for review February 28, 2024 17:23
@CookiePLMonster
Copy link
Contributor

I'm curious, what's the current test benchmark result? Is it still 6-ish seconds?

@bolknote
Copy link
Contributor Author

I'm curious, what's the current test benchmark result? Is it still 6-ish seconds?

Oddly enough, yes. For some reason I thought it would be faster.

@skotopes
Copy link
Member

skotopes commented Mar 4, 2024

So if it's not faster then why we need it?

@CookiePLMonster
Copy link
Contributor

So if it's not faster then why we need it?

If I understood correctly, this implementation is not faster than the initial one that didn't check the binary length up-front and possibly accessed memory out of bounds. It is still much faster than the "naive" approach:

(6 sec vs 39 sec)

@skotopes
Copy link
Member

skotopes commented Mar 4, 2024

I'm very doubt that we need it: we don't work with strings a lot, and when we do they are quite small.
Plus this code is write only: very cryptic, no test coverage. No one will be able(will want) to support it.

At the same time I do things that you've made quite interesting thing and you should try to contribute it to arm dsp lib.

@skotopes skotopes closed this Mar 4, 2024
@CookiePLMonster
Copy link
Contributor

you should try to contribute it to arm dsp lib.

That's a fair point, does CMSIS offer string manipulation? CMSIS-DSP doesn't seem to: https://www.keil.com/pack/doc/CMSIS/DSP/html/index.html

@skotopes
Copy link
Member

skotopes commented Mar 4, 2024

@CookiePLMonster hmm, that was closest thing I was able to think of. Not sure where else it can be useful.

@CookiePLMonster
Copy link
Contributor

Does mlib itself have platform-specific optimizations? If so, maybe they'd be interested in that which also means Flipper would eventually get this change back, just in mlib form.

@skotopes
Copy link
Member

skotopes commented Mar 4, 2024

Not really, but you may try to ask m-lib author, maybe he will be interested in it

@bolknote
Copy link
Contributor Author

bolknote commented Mar 4, 2024

@skotopes

Hi!

I understand your doubts! I want to clarify two things for myself. If I wrote unit tests for this function and explained in detail how it works in the comment, would the commit be accepted?

I'm working in another commit on Unicode support in Flipper and the API functions I'm rewriting have operations to get Unicode characters by index. Since, in my experience, these operations are always slower than binary operations, I thought I'd try to speed them up.

So if there is hope that the commit can be accepted, I would like to extend the work on it.

@skotopes
Copy link
Member

skotopes commented Mar 4, 2024

There are 2 groups for PRs like this one(low level, cryptic, blackmagic):

  • Essential. Brings visible, significant improvements. Those are going to be accepted.
  • Non essential. Speedups something that not used that often, no visible improvements. Those are not.

All optimizations of that kind requires unit tests with cross testing between optimized and un-optimized versions.

The best option is to ping us in GH issues or on discord before starting anything like that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core+Services HAL, furi & core system services
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants