Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to repro memory out of bounds errors in CI #1199

Closed
wants to merge 7 commits into from

Conversation

brandonpayton
Copy link
Member

This is a test PR to hopefully reproduce memory access out of bounds errors with #1189.

So far, @adamziel has only encountered errors in CI (see #1194), but there was one error report in WP.org Slack:
https://wordpress.slack.com/archives/C06Q5DCKZ3L/p1712232849960669

@brandonpayton brandonpayton added [Type] Bug An existing feature does not function as intended [Priority] High [Feature] PHP.wasm labels Apr 4, 2024
@brandonpayton brandonpayton self-assigned this Apr 4, 2024
@brandonpayton
Copy link
Member Author

Memory access out of bounds error:
https://github.com/WordPress/wordpress-playground/actions/runs/8562097173/job/23464704851#step:4:200

stderr | src/lib/steps/site-data.spec.ts > Blueprint step setSiteOptions() > should set the site option
ErrorEvent { isTrusted: [Getter] }
RuntimeError: memory access out of bounds
    at ZEND_SEND_ARRAY_SPEC_HANDLER (wasm://wasm/033df1be:1:6710306)
    at execute_ex (wasm://wasm/033df1be:1:6442747)
    at zend_execute (wasm://wasm/033df1be:1:2218529)
    at dynCall_vii (wasm://wasm/033df1be:1:6481715)
    at ret.<computed> (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:6682:24)
    at runtime.asm.<computed> (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/universal/src/lib/wasm-error-reporting.ts:45:13)
    at invoke_vii (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:7572:3)
    at zend_eval_stringl (wasm://wasm/033df1be:1:2219687)
    at zend_eval_string (wasm://wasm/033df1be:1:5391373)
    at dynCall_iiii (wasm://wasm/033df1be:1:6483396) {
  cause: Error: 
      at Object.Asyncify.handleSleep (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:7771:49)
      at _wasm_poll_socket (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:6543:18)
      at php_pollfd_for (wasm://wasm/033df1be:1:597125)
      at php_network_connect_socket (wasm://wasm/033df1be:1:3745501)
      at php_tcp_sockop_set_option (wasm://wasm/033df1be:1:5799873)
      at php_openssl_sockop_set_option (wasm://wasm/033df1be:1:5843983)
      at _php_stream_set_option (wasm://wasm/033df1be:1:227674)
      at dynCall_iiiii (wasm://wasm/033df1be:1:6483175)
      at ret.<computed> (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:6682:24)
      at runtime.asm.<computed> (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/universal/src/lib/wasm-error-reporting.ts:45:13)
}

@adamziel @bgrgicak @dmsnell

It's very possible this is caused by the recent memory leak workaround, but is it possible this a real out of bounds memory access that was previously masked by the fact that adjacent memory was not truly freed?

To test this, I'm going to try commenting out the "free()" call in wasm_memory_storage, and see what happens. If no tests run out of memory (except the OOM test I just added) and the failing test is no longer triggering an out of bounds error, that may be a clue.

@brandonpayton brandonpayton force-pushed the try-memory-out-of-bounds-repro branch from 3ea3a86 to dab7060 Compare April 5, 2024 00:07
@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 5, 2024

Interesting. When I commented out the free() call in wasm_memory_storage.c, the memory access out of bounds test failure goes away.

With free() in wasm_memory_storage.c, the test failure is due to access out of bounds. From an earlier CI run:

⎯⎯⎯⎯⎯⎯⎯ Failed Tests 1 ⎯⎯⎯⎯⎯⎯⎯
 FAIL  src/lib/steps/site-data.spec.ts > Blueprint step setSiteOptions() > should set the site option
Error: memory access out of bounds
 ❯ _NodePHP.#handleRequest ../../php-wasm/universal/src/lib/base-php.ts:657:21
    655|     'betterMessage' in err ? err.betterMessage : err.message
    656|    ) as string;
    657|    const rethrown = new Error(message);
       |                     ^
    658|    rethrown.cause = err;
    659|    console.error(rethrown);
 ❯ _NodePHP.run ../../php-wasm/universal/src/lib/base-php.ts:278:21
 ❯ src/lib/steps/site-data.spec.ts:29:20

Without free() is wasm_memory_storage.c, the only test failure is the Out of Memory test we just added. From the most recent CI run:

⎯⎯⎯⎯⎯⎯⎯ Failed Tests 1 ⎯⎯⎯⎯⎯⎯⎯
 FAIL  src/test/php-memory.spec.ts > PHP 8.0 – memory allocation > can concat large string out of many small strings without reaching Out-of-memory condition
Error: PHP.run() failed with exit code 255 and the following output: PHP Fatal error:  Out of memory (allocated 19988480) (tried to allocate 15794200 bytes) in php-wasm run script on line 5

Unless the access out of bounds error has something to do with wasm_memory_storage's free() call itself, maybe there is a real access out of bounds condition. What was previously accessing leaked memory could now be attempting to access freed memory.

I'm trying to keep my mind open, but why else would stopping free() silence the access out of bounds error?

At least two possibilities:

  • Real access out of bounds, attempting to access freed memory
  • Perhaps an incorrect free() call could trigger an access out of bounds error (but it seems like that would trigger another kind of error)

@brandonpayton
Copy link
Member Author

Note: I added a commit to hopefully log OS and node version but found afterward that the info is already linked to in CI job "Set up job" section.
https://github.com/actions/runner-images/blob/ubuntu22/20240324.2/images/ubuntu/Ubuntu2204-Readme.md

This includes the OS version and various software versions, including Node.js.

@brandonpayton
Copy link
Member Author

At least two possibilities:

  • Real access out of bounds, attempting to access freed memory
  • Perhaps an incorrect free() call could trigger an access out of bounds error (but it seems like that would trigger another kind of error)

In my testing with whatever version of Emscripten is installed with homebrew (3.1.56-git), it turns out that both assigning out of bounds and free()ing out of bounds trigger RuntimeError: memory access out of bounds.

I alternately commented out the out of range assignment and free() to test in the example below.

#include <stdio.h>
#include <malloc.h>

int main() {

	char* ptr = malloc(8 * 1024 * 1024);
	printf("Ptr: %p\n", ptr);

	char* ptr_oob = ptr + (17 * 1024 * 1024);
	printf("Oob: %p\n", ptr_oob);
	*ptr_oob = 'A';
	//free(ptr_oob);

	return 0;	
}

I am trying to reproduce this by running the CI workflow locally using act. If we can do that, then it should be easier to debug and find what the issue is at runtime. If not, perhaps we can take the PHP .wasm and .js files and follow the names and addresses from error's call stack to find the culprit.

@adamziel
Copy link
Collaborator

adamziel commented Apr 5, 2024

One thing that keeps bugging me is what's up with these socket connections in the stack trace? The crash stems from the get_option() call, why would it start a network call?

I wonder if php.run() starts a network connection that interrupts the execution and triggers a new execution within the same PHP instance. It would be very weird because I don't think that test runs a server, but I don't have a better explanation.

at Object.Asyncify.handleSleep (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:7771:49)
          at _wasm_poll_socket (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:6543:18)
          at php_pollfd_for (wasm://wasm/033e112a:1:597125)
          at php_network_connect_socket (wasm://wasm/033e112a:1:3745501)
          at php_tcp_sockop_set_option (wasm://wasm/033e112a:1:5[799](https://github.com/WordPress/wordpress-playground/actions/runs/8575488635/job/23504614176#step:6:800)987)
          at php_openssl_sockop_set_option (wasm://wasm/033e112a:1:5844097)
          at _php_stream_set_option (wasm://wasm/033e112a:1:227674)
          at dynCall_iiiii (wasm://wasm/033e112a:1:6483276)
          at ret.<computed> (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/node/public/php_8_0.js:6682:24)
          at runtime.asm.<computed> (/home/runner/work/wordpress-playground/wordpress-playground/packages/php-wasm/universal/src/lib/wasm-error-reporting.ts:45:13)

@brandonpayton
Copy link
Member Author

@adamziel, that's helpful! 😄 The socket connections seemed strange, but since Playground is new to me, I wasn't certain they were indeed strange. That's something more to think about.

Will try to log each call to original functions from instrumentWasmExports(). Perhaps that will give us a hint about what is triggering this.

@brandonpayton brandonpayton force-pushed the try-memory-out-of-bounds-repro branch from 5a147a8 to cf0dc4e Compare April 6, 2024 21:31
@brandonpayton brandonpayton requested a review from a team as a code owner April 6, 2024 23:05
@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 7, 2024

brandonpayton requested a review from https://github.com/orgs/WordPress/teams/playground-maintainers as a code owner

^ Hmm... not sure how this review request happened. This is just a PR for trying whatever we can to obtain more info about how we're getting a "memory access out of bounds" error in CI unit tests.

I tried:

  • Heavily instrumenting PHP 8.0 around the apparent source of the error
  • Adding names to the anonymous functions in the error call stack to help us better identify the original call leading to the error.
  • Experimenting with Emscripted ASSERTION=1, SAFE_HEAP=1, SAFE_HEAP_LOG=1 options to see if they revealed additional info or errored earlier in the process
  • Upgrading vitest to see if disabling fileParallelism had any affect (which would be a clue) ... I'm not sure if I tested this one properly as we have multiple test suites that may be running at the same time. The setting I tweaked may have just affected execution of a single suite, but what we really want to test is whether running a single test at a time has any affect on the error.

So far, I haven't learned much more than we already knew.

The most interesting thing I noticed from the instrumentation was that there is an apparently anonymous PHP function call not too long before the out of bounds access error.
https://gist.github.com/brandonpayton/27af9332eb587f328c2b8716cbe621ea#file-function-call-and-return-logs-L37

I'm going to try upgrading building with a more up-to-date Emscripten version to see if that has any effect. I doubt it will, but it is worth a try.

After that, I plan to let this lie for a couple days of AFK and then resume as able on Tuesday.

@brandonpayton brandonpayton force-pushed the try-memory-out-of-bounds-repro branch from 2b4f00b to 350f625 Compare April 7, 2024 04:26
@brandonpayton
Copy link
Member Author

A couple more things to explore:

  • Instrument more of the socket- and fd-related functions to see if we can learn what might be going on
  • Try patching PHP memory allocation more directly than using a PHP extension. Perhaps we are running into this problem because there is some period of time before wasm_memory_storage is loaded when PHP allocates memory differently.

@adamziel
Copy link
Collaborator

adamziel commented Apr 8, 2024

The network call is:

POST /wp-cron.php?doing_wp_cron=1712572234.0380001068115234375000 HTTP/1.1
Host: 127.0.0.1:8000
User-Agent: WordPress/6.4.3; http://127.0.0.1:8000/
Accept-Encoding: deflate;q=1.0, compress;q=0.5, gzip;q=0.5
Content-Length: 0
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Connection: Close

The error is gone when DISABLE_WP_CRON is set to true.

@adamziel
Copy link
Collaborator

adamziel commented Apr 8, 2024

Aha, it's not about the DISABLE_WP_CRON constant. It's about having the defineWpConfigConsts step in place. The code below crashes, but if you uncomment the defineWpConfigConsts call then the test actually passes. Could that be a memory alignment issue? The crash seems to be happening in just the right conditions, and prior allocations seem to affect those.

	it('should set the site option', async () => {
		// await defineWpConfigConsts(php, {
		// 	consts: {
		// 		TEST: 1,
		// 	},
		// });
		await setSiteOptions(php, {
			options: {
				blogname: 'My test site!',
			},
		});
		console.log('after setSiteOptions');
		const response = await php.run({
			code: `<?php
                require '/wordpress/wp-load.php';
                echo get_option('blogname');
			`,
		});
		expect(response.text).toBe('My test site!');
	});

@adamziel adamziel force-pushed the try-memory-out-of-bounds-repro branch from da3afdd to 350f625 Compare April 8, 2024 11:14
@adamziel
Copy link
Collaborator

adamziel commented Apr 8, 2024

CI in this PR now seems to be failing with these errors:

"PHP Fatal error:  Uncaught PDOException: SQLSTATE[HY000]: General error: 10 disk I/O error in /wordpress/wp-content/mu-plugins/sqlite-database-integration/wp-includes/sqlite/class-wp-sqlite-translator.php:400

I'm not sure what that's all about – it works in #1212 and, other than the newer emscripten version, I don't see any significant changes between 5a6343a and 350f625. I wonder if the version upgrade changed anything significant in the FS handling – or perhaps one of our "regex patches" no longer applies cleanly?

@adamziel
Copy link
Collaborator

adamziel commented Apr 8, 2024

@brandonpayton a few questions that came to mind:

  • Is it crashing on allocation or on free()?
  • Could it be wasm_memory_storage_chunk_alloc() returning NULL when a pointer is expected?
  • Which pointer is it trying to allocate/free? Was it already freed before by a different allocator? Is any data structure stored on the HEAP at that address?
  • Is there any chance PHP still tries to rely on partial unmap semantics?

You may call JavaScript code from C, too, to inspect the HEAP directly or print stack traces.

@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 8, 2024

CI in this PR now seems to be failing with these errors:

"PHP Fatal error:  Uncaught PDOException: SQLSTATE[HY000]: General error: 10 disk I/O error in /wordpress/wp-content/mu-plugins/sqlite-database-integration/wp-includes/sqlite/class-wp-sqlite-translator.php:400

@adamziel Sorry, I was trying a bunch of different things on Saturday and could have left this in a better state for you and others to look at.

This is just an error seen after updating Emscripten, and it is probably a distraction (though very interesting for later :). The point of updating Emscripten was just a thing to try at the end of the day to see if it affected anything. Though if it did avoid the error, we couldn't be sure it wasn't just hiding the problem for the moment.

If I have time later today, I'll try to restore where this branch was with instrumented JS and PHP, so you can more easily see the CI output for that.

I'm not sure what that's all about – it works in #1212 and, other than the newer emscripten version, I don't see any significant changes between 5a6343a and 350f625. I wonder if the version upgrade changed anything significant in the FS handling – or perhaps one of our "regex patches" no longer applies cleanly?

Yeah... that's something we can test sometime apart from this memory-related fix. It was just something to try to see if any behavior changed or if the memory handling caught this earlier or provided more info on the crash.

@brandonpayton a few questions that came to mind:

  • Is it crashing on allocation or on free()?

When wasm_memory_storage.c was instrumented to log the start and end of all chunk_alloc() and chunk_free() calls, I didn't see anything amiss. Here are the wasm_memory_storage logs from one CI run:
https://gist.github.com/brandonpayton/27af9332eb587f328c2b8716cbe621ea#file-alloc-and-free-by-wasm_memory_storage

One thing I saw in test is that tests passed after commenting out free() in wasm_memory_storage. This could mean that there is some other issue in Playground that was relying upon memory never being freed by munmap(). Or it might be something else. But based on the logs above, the crash hasn't appeared to occur within wasm_memory_storage.c.

After I restore this branch to using our current Emscripten version and the instrumented JS and PHP, I plan to test this again. Sometimes, CI tests don't catch the "memory access out of bounds" error, but most of the time they have. So I'd like to re-run the tests a bit with the free() removed to be more certain it's not a fluke.

  • Could it be wasm_memory_storage_chunk_alloc() returning NULL when a pointer is expected?

This is possible, but the logs above don't show that.

  • Which pointer is it trying to allocate/free? Was it already freed before by a different allocator? Is any data structure stored on the HEAP at that address?

The logs show the addresses, but I'd like to additionally have wasm_memory_storage_chunk_free() print the size passed to it so we get a more complete picture.

The only odd thing I see here is this call to free() where it is passed an address not seen earlier in the logs:
https://gist.github.com/brandonpayton/27af9332eb587f328c2b8716cbe621ea#file-alloc-and-free-by-wasm_memory_storage-L25-L26

  • Is there any chance PHP still tries to rely on partial unmap semantics?

I've wondered this too and just made a note to look and consider this further. Maybe there are other places we can instrument with log messages.

The hardest thing in nailing this down is that we've only been able to reproduce this with GitHub CI. I tried using act to run the CI workflow locally, but it passed with no errors. I've thought about setting up a self-hosted GitHub actions runner to see if local repro was possible there.

You may call JavaScript code from C, too, to inspect the HEAP directly or print stack traces.

Thank you for the link! I wasn't sure what was possible here.

One thing I really want is what address is out of bounds. That could help us know whether it is part of previously allocated space or just something else, like a totally bad address that is far out of bounds.

@brandonpayton
Copy link
Member Author

brandonpayton commented Apr 10, 2024

With trying multiple things within this PR, especially with some of the force-pushing that rewrote previous history and hid associated test runs, it turned out to be very hard for anyone to really follow all the things I've tried so far. And it was even hard for me to follow because links to previous CI runs were hidden after force push.

I think we should probably close this and use dedicated PRs for each of the approaches. We can leave multiple investigatory PRs open at the same time and just keep documenting the findings under the original issue #1128.

For example, the following would be separate PRs:

  • Instrumenting PHP and JS code to learn more about the crash
  • Checking whether upgrading Emscripten seems to work around the issue or adds better checks that help identify the problem
  • Enabling ASSERTIONS and SAFE_HEAP to see whether they provide more info
  • Using our own pre-built version of node that will print the actual out of bounds address

I am closing this and working on creating more focused PRs to help nail this problem down with less confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
[Feature] PHP.wasm [Priority] High [Type] Bug An existing feature does not function as intended
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants