Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add implementation of emscripten_memcpy_big based on bulk memory. #19128

Merged
merged 1 commit into from
Apr 11, 2023

Conversation

sbc100
Copy link
Collaborator

@sbc100 sbc100 commented Apr 3, 2023

These new functions live in libbulkmemory which only gets included if bulk memory is enabled (either via -mbulk-memory directly or indirectly via `-pthread).

benchmark results for benchmark.test_memcpy_1mb:

 v8:                       mean: 1.666
 v8-bulkmemory:            mean: 1.598
 v8-standalone-bulkmemory: mean: 1.576
 v8-standalone:            mean: 3.197

Here we can see the that when bulk memory is enabled its at least as fast if not faster than the JS version.

v8-standalone doesn't have emscripten_memcpy_big at all is is much slower, as expected. By adding -mbulk-memory the standalone version becomes just as fast as the non-standalone.

@sbc100 sbc100 requested review from kripken and tlively April 3, 2023 23:31
@sbc100 sbc100 force-pushed the bulk_memory branch 2 times, most recently from b6ea0f2 to 3a26933 Compare April 3, 2023 23:35
emcc.py Show resolved Hide resolved
system/lib/libc/emscripten_memset.c Show resolved Hide resolved
system/lib/standalone/standalone.c Show resolved Hide resolved
@sbc100 sbc100 force-pushed the bulk_memory branch 5 times, most recently from 00f7e9b to 06d45a4 Compare April 4, 2023 00:45
@sbc100 sbc100 requested review from tlively and kripken April 4, 2023 01:09
@sbc100 sbc100 force-pushed the bulk_memory branch 2 times, most recently from b850366 to ddd7446 Compare April 4, 2023 05:05
test/test_other.py Show resolved Hide resolved
system/lib/libc/emscripten_memcpy.c Show resolved Hide resolved
@sbc100 sbc100 force-pushed the bulk_memory branch 2 times, most recently from bee195a to c8023e8 Compare April 10, 2023 16:02
These new functions live in `libbulkmemory` which only gets included
if bulk memory is enabled (either via `-mbulk-memory` directly or
indirectly via `-pthread).

benchmark results for benchmark.test_memcpy_1mb:

```
 v8:                       mean: 1.666
 v8-bulkmemory:            mean: 1.598
 v8-standalone-bulkmemory: mean: 1.576
 v8-standalone:            mean: 3.197
```

Here we can see the that when bulk memory is enabled its at least as
fast if not faster than the JS version.

v8-standalone doesn't have emscripten_memcpy_big at all is is much
slower, as expected.  By adding `-mbulk-memory` the standalone version
becomes just as fast as the non-standalone.
@sbc100 sbc100 merged commit 6f3cfe3 into main Apr 11, 2023
2 checks passed
@sbc100 sbc100 deleted the bulk_memory branch April 11, 2023 04:51
@haberman
Copy link

If bulk memory is available, why can't memcpy() and memset() use memory.copy and memory.fill exclusively? Why does emscripten_memcpy_big() need to exist at all in that case?

This code seems to be written with an assumption that the bulk memory operations have some kind of overhead that makes them only worthwhile for copies/fills of greater than 512 bytes. Is this true?

@sbc100
Copy link
Collaborator Author

sbc100 commented Aug 27, 2023

If bulk memory is available, why can't memcpy() and memset() use memory.copy and memory.fill exclusively? Why does emscripten_memcpy_big() need to exist at all in that case?

This code seems to be written with an assumption that the bulk memory operations have some kind of overhead that makes them only worthwhile for copies/fills of greater than 512 bytes. Is this true?

I think you could well be correct and we might just want to completely remove the traditional __musl_memcpy when bulk memory is available.

The code that this was replacing involved calling out the JavaScript which certain had/has a non-zero overhead. We would probably want measure imperially weather memory.copy and memory.fill are faster or not for small copies, but it seems likely to me that there would have the same or better performance costs.

Do you have a particular reason to want this change? For example, is the cost of including __musl_memcpy concerning you?

@haberman
Copy link

Do you have a particular reason to want this change? For example, is the cost of including __musl_memcpy concerning you?

I am looking at a profile where __memcpy() (and __musl_memset() to a lesser extent) are significant costs, and I am hoping that using memory.copy and memory.fill would speed up my benchmark.

I don't know of a way to test this theory. Do you know of any way to force memcpy() to compile to memory.copy? I tried using __builtin_memcpy(), but oddly that still seems to call the __memcpy().

@sbc100
Copy link
Collaborator Author

sbc100 commented Aug 28, 2023

The simplest way to do that would probably be to just remove the if (n >= 512) { check from memcpy.c, and then rebuild libc (either via ./embuilder build libc --force or ./emcc --clear-cache (to force all libraries to be rebuilt).

sbc100 added a commit that referenced this pull request Aug 28, 2023
Unlike the JS versions of these function there is no need to only use
these for small inputs.

Results of running the test_memcpy_128b test before and after this
change:

```
v8-bulk: mean: 1.536 (+-0.071) secs  median: 1.495  range: 1.471-1.650  (noise: 4.630%)  (5 runs)
        size:   149291, compressed:    54249
```

->

```
v8-bulk: mean: 1.489 (+-0.117) secs  median: 1.535  range: 1.268-1.606  (noise: 7.871%)  (5 runs)-
        size:   148387, compressed:    53813-
```

See comments in #19128
sbc100 added a commit that referenced this pull request Aug 28, 2023
Unlike the JS versions of these function there is no need to only use
these for small inputs.

Results of running the test_memcpy_128b test before and after this
change:

```
v8-bulk: mean: 1.536 (+-0.071) secs  median: 1.495  range: 1.471-1.650  (noise: 4.630%)  (5 runs)
        size:   149291, compressed:    54249
```

->

```
v8-bulk: mean: 1.489 (+-0.117) secs  median: 1.535  range: 1.268-1.606  (noise: 7.871%)  (5 runs)-
        size:   148387, compressed:    53813-
```

See comments in #19128
sbc100 added a commit that referenced this pull request Aug 28, 2023
Unlike the JS versions of these function there is no need to only use
these for small inputs.

Results of running the test_memcpy_128b test before and after this
change:

```
v8-bulk: mean: 1.536 (+-0.071) secs  median: 1.495  range: 1.471-1.650  (noise: 4.630%)  (5 runs)
        size:   149291, compressed:    54249
```

->

```
v8-bulk: mean: 1.489 (+-0.117) secs  median: 1.535  range: 1.268-1.606  (noise: 7.871%)  (5 runs)-
        size:   148387, compressed:    53813-
```

See comments in #19128
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants