Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explorations: Stream API #851

Closed
wants to merge 74 commits into from

Conversation

adamziel
Copy link
Collaborator

@adamziel adamziel commented Dec 9, 2023

⚠️ Do not merge ⚠️ I use this large PR to explore, but I'd rather merge these changes a series of smaller PRs that are easier to discuss and review:

Description

Implements stream-based ZIP encodes and decoder using the CompressionStream and DecompressionStream class.

Here's what we get:

  • Native ZIP support without having to rely on PHP's ZipArchive
  • Download and unzip WordPress plugins at the same time. Before this PR we had to download the entire bundle, pass it to PHP, run PHP code, and only then the file would be unzipped.
  • Partial download of large zip files.

To that last point:

ZIP as a remote, virtual filesystem

This change enables fast previewing of even 10GB-large zipped exports via partial downloads.

Imagine previewing a large site export with many photos and videos. The decodeRemoteZip function knows how to request just the list of files first, filter out the large ones, and then issue multiple fetch() requests to download the rest.

Effectively, we would only download ~5MB - 10MB of data for the initial preview, and then only download these larger assets once they're needed.

Technical details

Here's a few interesting functions shipped by this PR. Note the links point to a specific commit and may get outdated:

  • nextZipEntry() that decodes a zipped file
  • decodeRemoteZip() lists ZIP files in a remote archive, filters them, and then downloads just the subset of bytes we need to get those files
  • encodeZip() turns a stream of File objects into a zip archive (as stream of bytes)

Remaining work

There's a few more things to do here, but I still wanted to get some reviews in before spending the time on these just in case the API would substantially change:

  • Add unit tests.
  • Solve conflicts
  • Get the CI checks to pass.
  • Test in Safari where the support for streams seems to be limited somehow (MDN shows red X-es on a few pages, let's investigate this)
  • This PR updates Blueprints and the GitHub integration to demonstrate the impact on the codebase and give the reviewers more context. These changes will be offloaded to a series of follow-up PRs before this one is merged.
  • Merge!
  • Release new npm packages.
  • Refactor progress monitoring to use ReadableStream.tee() instead of the workaround we use now – once https://bugs.chromium.org/p/chromium/issues/detail?id=1512548 is fixed in chromium.

API changes

Breaking changes

This PR isn't a breaking change yet. One of the follow-up PRs will very likely propose some breaking changes, but this one only extends the available API.

Without this PR

Without this PR, unzipping a file requires writing it to Playground, calling PHP's unzip, and removing the temporary zip file:

const response = await fetch(remoteUrl);
// Download the entire byte array first
const bytes = new Uint8Array(await response.arrayBuffer());
// Copy those bytes into Playground memory
await writeFile(playground, {
	path: tmpZipPath,
	data: zipFile,
});
// Run PHP code and use `ZipArray` via unzip()
await unzip(playground, {
	zipPath: tmpZipPath,
	extractToPath: targetPath,
});
// Only now is the ZIP file extracted.
// We still need to clean up the temporary file:
await playground.unlink(tmpZipPath);

With this PR

With this PR, unzipping works like this

const response = await fetch(remoteUrl);
// We can now unzip as we stream response bytes
decodeZip( response.body )
	// We also write the stream of unzipped files to PHP as new entries become available
	.pipeTo( streamWriteToPhp( playground, targetPath ) )

More examples

Here's what else the streaming API unlocks. Not all of these functions are shipped here, but they are quite easy to implement:

// In the browser, fetch a zip file:
(await fetch(url))
	.body
	.pipeThrough(decodeZip())
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// In the browser, install from a VFS directory:
iteratorToStream(iteratePhpFiles(path))
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// In the browser, install from a .zip inside VFS:
streamReadPhpFile(php, path)
	.pipeThrough(decodeZip())
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// Funny way to do a recursive copy
iteratorToStream(iteratePhpFiles(php, fromPath))
	.pipeTo(streamWriteToPhp(php, toPath))

// Process a doubly zipped artifact from GitHub CI
(await fetch(artifactUrl))
	.body
	.pipeThrough(decodeZip())
	.pipeThrough(readBody())
	.pipeThrough(decodeZip())
	.pipeTo(streamWriteToPhp(php, pluginsDirectory))

// Export Playground files as zip
iteratorToStream(iteratePhpFiles(php, fromPath))
	.pipeThrough(encodeZip())
	.pipeThrough(concatBytes())
	.pipeTo(downloadFile('playground.zip'))

// Export Playground files to OPFS
iteratorToStream(iteratePhpFiles(php, fromPath))
	.pipeTo(streamWriteToOpfs('/playground'))

// Compute changeset to export to GitHub
changeset(
	iterateGithubFiles(org, repo, branch, path),
	iteratePhpFiles(php, fromPath)
);

// Read a subdirectory from a GitHub repo
decodeRemoteZip(
	zipballUrl,
	({ path }) => path.startsWith("themes/adventurer")
)
	.pipeThrough(enterDirectory('themes/adventurer'))
	.pipeTo(streamWriteToPhp(php, joinPath(themesPath, 'adventurer')))

// Write a single file from the zip into a path in PHP
decodeRemoteZip(
	artifactUrl,
	({ path }) => path.startsWith("path/to/README.md")
)
	.pipeTo(streamWriteToPhp(php, '/wordpress'))

// In node.js, install a plugin from a disk directory
iteratorToStream(iteratePhpFiles(php, path))
	.pipeTo(streamWriteToPhp(php, pluginsDir))
;

Open questions

How can streams be expressed in Blueprints?

Solving this is out of scope for this PR, but it's one of the next problems to explore so we might as well use this PR as a space to contemplate. I'll move this into a separate issue once we solidify the shape of the API here and ship this PR.

URI notation

Perhaps the Blueprint steps could accept a protocol://URI notation? This is the option I'm leaning towards, unless a better alternative comes up.

Upsides:

  • Succinct
  • Somewhat familiar, e.g. git does git+https://git@...

Downsides:

  • It still looks a bit foreign and requires some getting used to
  • It cannot easily express nested resources like "A path inside a zip" or "A stream inside a zip"
{
	"steps": [
		{
			step: 'installPlugin',
			pluginDir: 'zipdir+plugin://hellodolly',
		},
		{
			step: 'installPlugin',
			pluginDir: 'zipdir+https://mysite.com/plugin.zip:hello-dolly/',
		}
	]
}

Object-based DSL

We can have unlimited flexibility with a custom DSL like below

Upsides:

  • Flexibility

Downsides:

  • Even more complex than the URI proposal. Requires training, looks foreign, noone would be able to write it without consulting the documentation.
{
	"steps": [
		{
			step: 'installPlugin',
			pluginFiles: {
				"resource": "unzip",
				"pathInside": "hello-dolly/",
				"zipFile": {
					"resource": "https",
					"url": "https://mysite.com/hello-dolly.zip"
				}
			}
		}
	]
}

Piping DSL

How about we mimic the JavaScript API using the JSON notation?

Upsides:

  • Flexibility
  • Less complex than the object-based DSL above
  • Looks similar to the JavaScript counterpart

Downsides:

  • We're essentially JavaScript and creating a micro-language in JSON. I don't want people to express their code in JSON, I want them to have a convenient tool that can easily express complex concepts.
{
	"steps": [
		{
			step: 'installPlugin',
			pluginFiles: { "pipe": [
				{ "fetch": "https://mysite.com/hello-dolly.zip" },
				"unzip",
				{ "enterDirectory": "hello-dolly/" }
			} ]
		}
	]
}

cc @dmsnell

…them into PHP().

Also, explore using file iterators as a basic abstraction for passing
the file trees around.

Work in progress
Copy link
Collaborator

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to see the elimination of the intermediate unpacked folder. This should bring a nice improvement in the latency of installing ZIP files, leading to things finishing quite a bit faster during installs and boot.

entry['uncompressedSize'] = await stream.readUint32();
entry['fileNameLength'] = await stream.readUint16();
entry['extraLength'] = await stream.readUint16();
entry['fileName'] = new TextDecoder().decode(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it appears that Emscripten is the real culprit here, but filenames are not encoded. they are bytes separated by ASCII /. this will end up breaking filenames and inserting in places.

});
}

const data = await stream.read(header.compressedSize);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious: what happens if we attempt to read more bytes than are available?

Copy link
Collaborator Author

@adamziel adamziel Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the custom Stream class implementation. Right now we read n bytes using the native ReadableStreamBYOBReader class which sends a "read request" to the data source. Every data source is free to process it how it wants, but typically it will only the buffer up to Math.min(Buffer.size, numberOfAvailableDataBytes).

adamziel added a commit that referenced this pull request Dec 18, 2023
…nfig, and the sqlite-database-integration plugin (#872)

## Description

Includes the `wp-config.php` file and the default WordPress theme (like
`twentytwentythree`) in the zip file exported via the "Download as .zip"
button.

Without those files, the exported bundle isn't self-contained. It cannot
be hosted, and any the importer needs to provide the missing files on
its own. The theme and the plugins are easy to backfill, but the data
stored in `wp-config.php` is lost.

## How is the problem addressed?

This PR adds a temporary private `selfContained` option to the
`importWpContent`. It is `false` by default to ensure those files are
not exported with GitHub PRs (the export flow relies on the same logic).
The zip download button sets it to `true`.

This is a temporary workaround as we currently don't have any better
tools to deal with this problem. Once the streaming/iterators API ships
in #851, we'll be
able to get rid of this hack and just filter the stream of files.

## Testing Instructions

Unfortunately, this PR ships no unit tests as without #895 there isn't
an easy way to test the `zipWpContent` function. Here's the manual
testing steps to take:

1. Open Playground
2. Make a change in the site content
3. Export Playground into a zip file
4. Confirm that zip file contains the `wp-content.php` file as well as
the `twentytwentyfour` theme and the `sqlite-database-integration`
plugin
5. Refresh the Playground tab and import that zip
6. Confirm it worked and the website is functional and has the content
update from before
7. Export it to GitHub, check the "include zip file" checkbox
8. Confirm the GitHub PR has no `twentytwentyfour` theme, or the
`wp-config.php` file, or the `sqlite-database-integration` plugin.
9. Do the same for the zip bundled with the GitHub PR
10. Import that PR and confirm it imports cleanly
@adamziel adamziel changed the title Stream-based zip encoder and decoder Exploration: Stream-based zip encoder and decoder Dec 18, 2023
@adamziel adamziel changed the title Exploration: Stream-based zip encoder and decoder Explorations: Stream API Dec 18, 2023
adamziel added a commit that referenced this pull request Dec 19, 2023
Adds a new `@php-wasm/node-polyfills` package to polyfill the features
missing in Node 18 and/or JSDOM environments. The goal is to make wp-now
and other Playground-based Node.js packages work in Node 18, which is
the current LTS release.

The polyfilled JavaScript features are:

* `CustomEvent` class
* `File` class
* `Blob.text()` and `Blob.arrayBuffer()` methods
* `Blob.arrayBuffer()` and `File.text()` methods
* `Blob.stream()` and `File.stream()` methods
* Ensures `File.stream().getReader({ mode: 'byob' })` is supported –
this is relevant for #851

I adapted the Blob methods from
https://github.com/bjornstar/blob-polyfill/blob/master/Blob.js as they
seemed to provide just the logic needed here and they also worked right
away.

This PR is a part of
#851 split out
into a separate PR to make it easier to review and reason about.

Supersedes #865

## Testing instructions

Confirm the unit tests pass. This PR ships a set of vite tests to
confirm the polyfills work both in vanilla Node.js and in jsdom runtime
environments.
adamziel added a commit that referenced this pull request Dec 22, 2023
Stream Compression introduced in #851 has no dependencies on WordPress and
can be used in any JavaScript project. It also makes sense as a dependency
for some `@php-wasm` packages. This commit, therefore, moves it from the
`wp-playground` to the `php-wasm` npm namespace, making it reusable across
the entire project.

In addition, this adds a new `iterateFiles` function to the `@php-wasm/universal`
package, which allows iterating over the files in the PHP filesystem. It
uses the `stream-compression` package, which was some of the motivation for
the move.

 ## Testing instructions

Since the package isn't used anywhere yet, only confirm if the CI checks
pass.
adamziel added a commit that referenced this pull request Dec 22, 2023
Stream Compression introduced in #851 has no dependencies on WordPress and
can be used in any JavaScript project. It also makes sense as a dependency
for some `@php-wasm` packages. This commit, therefore, moves it from the
`wp-playground` to the `php-wasm` npm namespace, making it reusable across
the entire project.

In addition, this adds a new `iterateFiles` function to the `@php-wasm/universal`
package, which allows iterating over the files in the PHP filesystem. It
uses the `stream-compression` package, which was some of the motivation for
the move.

 ## Testing instructions

Since the package isn't used anywhere yet, only confirm if the CI checks
pass.
adamziel added a commit that referenced this pull request Dec 22, 2023
)

Stream Compression introduced in #851 has no dependencies on WordPress
and can be used in any JavaScript project. It also makes sense as a
dependency for some `@php-wasm` packages. This commit, therefore, moves
it from the `wp-playground` to the `php-wasm` npm namespace, making it
reusable across the entire project.

In addition, this adds a new `iterateFiles` function to the
`@php-wasm/universal` package, which allows iterating over the files in
the PHP filesystem. It uses the `stream-compression` package, which was
some of the motivation for the move.

This PR also ships eslint rules to keep the `stream-compression` package
independent from the heavy `@php-wasm/web` and `@php-wasm/node`
packages. This should enable using it in other project with a minimal
dependency overhead of just `@php-wasm/util` and
`@php-wasm/node-polyfills`.

## Testing instructions

Since the package isn't used anywhere yet, only confirm if the CI checks
pass.
adamziel added a commit that referenced this pull request Dec 22, 2023
This small commit brings a part of #851 into trunk for easier review.

 ## Testing instructions

Confirm the CI tests pass
adamziel added a commit that referenced this pull request Jan 8, 2024
…898)

This small commit brings a part of #851 into trunk for easier review.

 ## Testing instructions

Confirm the CI tests pass
adamziel added a commit that referenced this pull request Jan 8, 2024
## What is this PR doing?

#851 explores
migrating file handling in Playground from buffering to JS-native
streams, and this PR brings over small, administrative changes from the
main branch to unclutter it. This is mostly about updating
`package.json` files, updating configs, and imports.

## Testing Instructions

Confirm the CI checks pass.
@adamziel
Copy link
Collaborator Author

adamziel commented Jan 8, 2024

Safari doesn't support BYOB streams which is a blocker here. Perhaps there's a way to polyfill them? I experimented with that at 2b0acd0#diff-888d17d202646c67c560411a8e150946f30fec742ff592a7915d6c8370ab030e

@adamziel
Copy link
Collaborator Author

Surfacing the performance concerns from #919 (comment).

Buffering the entire .zip file and passing it to PHP.wasm in one go takes around 300ms on my machine. However, using the browser streams and writing each file as it becomes available takes around 2s where:

  • ~50% is spent decoding the ZIP file, using ArrayBuffers, ReadableStreams, and such
  • ~50% is spent sending each decompressed file to the PHP worker via postMessage

There's potentially a lot of room for improvement, but I'm not too keen about tweaking this.

Tweaking the stream-based implementation would take a lot of time, and I'm not convinced about the benefits. The lower bound for the execution time seems to be set by the native version of libzip and libz. My intuition is that:

  • JavaScript–based zip decoder is essentially a lot of overhead on top of that lower bound including executing JavaScript code, copying the data in JS VM more than we need to, marshalling and sending the data via postMessage many times over etc.
  • WASM zip decoder is that lower bound times a factor like wasmSlowerThanNative (which could be nuanced, non-linear etc).

All of this is hand wavy and based on my intuition. I don't have any actual measurements and perhaps I could spend a dozen or more hours here and either prove or disprove those assumptions, but I think there's a more promising avenue to explore instead.

I wonder if we could stream all the downloaded bytes directly to WASM memory, and stream-handle them there. JavaScript would become an API layer over WASM operations, much like numpy in python. We wouldn't be passing data around, just orchestrating what happens on the lower level. The API could look like this:

php
    .acceptDataStream( fetch( pluginFileURL ).body )
    .unzip()
    .writeTo( "/wordpress/wp-content/plugins/gutenberg" );

@adamziel
Copy link
Collaborator Author

adamziel commented Apr 3, 2024

Blueprints v2 make integrating JavaScript ZIP streaming into Blueprint steps unnecessary as the data is processed using PHP streams instead of JavaScript ones. Let's still keep the stream processing package around, though, as it's independent and useful to have.

@adamziel adamziel closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants