Streaming not supported #6

lrosenthol · 2016-11-17T18:50:47Z

One of the reasons that the packaging on the web spec chose to avoid ZIP in favor of something new & different was due to the (perceived) lack of streaming support.

However, this proposal suffers from the same problem. You cannot create it entirely in stream, due to (a) the way that offsets are used and (b) the index file needing to list all other files.

If streaming is not a requirement for this format - that's fine. But then, that should also be called out in the spec.

addyosmani · 2016-11-17T19:43:23Z

I would also like to get some clarity around how packaging interfaces with streaming (even if it's a non-goal for the first version).

dimich-g · 2016-11-18T01:20:23Z

Streaming is important and very much the goal, just as for the packaging on the web effort that you mention. I should have clarified that more (I will add to the Explainer, with corresponding example).

The idea is that there are two major use cases - streaming and 'local file'. The former is a regular way resources are used on the web - for example, one can package an SVG image with an SVG markup and PNG image that is used in that markup and refer to that from a regular web page. Or package a JS library. This usage is normally over HTTP or HTTPS and support for incremental streaming is important. The latter type of use cases can happen as result of local sharing, or saving a package for offline use. In that case, the package is on a local device in its entirety, and it is potentially huge ("wikipedia in a package" etc). In that case, it is important to be able to quickly access resource in the [huge] package w/o need to unpack it in any form, including things like seeking into a movie that is a part of package, etc.

So the proposed format is trying to address both! Note the usage of MIME-like parts and boundaries and per-parts headers - that allows streaming use by making it possible to parse the package while it trickles in. The offsets are serving the local case - by allowing use of efficient IO operations on a locally-stored 'file'. Note there is no information in the Content Index which would not be available from a part header, and also Content Index is optional.

The signature, if the package is signed, would require the Content Index (with hashes for parts). In that case, it makes sense for Content Index and certificate to be in the beginning of the package to facilitate streaming - so the incoming parts can be validated as they are becoming available. The tools that would build such a package could ensure that, this format doesn't depend on the requirement to keep a 'directory' at the end of the file as ZIP does (which does it mostly for ease of append of possibly duplicating files, which is not a goal for this format).

bmeck · 2016-11-29T06:02:08Z

Note the usage of MIME-like parts and boundaries and per-parts headers - that allows streaming use by making it possible to parse the package while it trickles in.

MIME like boundary strings need to ensure they do not exist in any of the content body. This means potential collisions or preprocessing to ensure the boundary string does not exist within the body. From https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html :

The encapsulation boundary MUST NOT appear inside any of the encapsulated parts. Thus, it is crucial that the composing agent be able to choose and specify the unique boundary that will separate the parts.

Things like chunked encoding do not suffer from the potential collision or preprocessing issue.

The signature, if the package is signed, would require the Content Index (with hashes for parts). In that case, it makes sense for Content Index and certificate to be in the beginning of the package to facilitate streaming - so the incoming parts can be validated as they are becoming available.

This can be done as a trailer per resource rather than a batch up front. Running the digest cannot finish until all the resource is available. I see no need to receive the signed digest prior to the content since the local digest to verify against cannot be created until the body is finished streaming.

this format doesn't depend on the requirement to keep a 'directory' at the end of the file as ZIP does (which does it mostly for ease of append of possibly duplicating files, which is not a goal for this format).

There are other reasons to use ZIP, I would just like to make a point of disagreement on this being the prevailing reason to use ZIP like trailing directories.

lrosenthol · 2016-11-29T06:35:32Z

one can package an SVG image with an SVG markup and PNG image that is used in that markup and refer to that from a regular web page

Now you are introducing a completely new issue which is lack of a standard media type and file extension, which would be required to be supported by any OWP client in order to allow this to handled as the result of a URL.

streaming, MIME parts, and index offsets
Link: cid:f47ac10b-58cc-4372-a567-0e02b2c3d479; rel=index; offset=12014/2048

As @bmeck mentioned, MIME boundary strings have issues when used for packaging. In addition, the use of an offset to the index in the header is impossible to have in a streamed format, since you don't know how large the data itself is, and thus where the index where be when you start.

So the proposed format is trying to address both!

It can't - at least not in a mixed model. If you are streaming out and streaming in - where there won't be any index/offsets. OR if you are creating an "offline" package upfront. Then sure, it works. BUT what won't work is streaming out a package that will work offline - or passing an existing offline package to a streaming recipient.

bmeck · 2016-12-29T15:43:47Z

note: my PR does not address not knowing the indexes during streaming, your server must place the Content Index at the end of stream and record offsets for each content resource as it goes. I find this an acceptable compromise since a client is still allowed to re-order content within the package.

jyasskin · 2017-01-26T23:29:04Z

I think "streaming" means at least 2 things here, and we need to distinguish them. Here are 3 scenarios with signed packages that might help structure the discussion. I'm not treating unsigned packages here because a client can rewrite them arbitrarily to make the boundaries and offsets work.

3 actors: a server that can sign content with a private key, a client who connects to the server and trusts the key but can't sign with it, and a peer who connects to the client and also trusts the key.

The server needs to dynamically generate at least one of the resources in the package, so it doesn't know all contents and offsets ahead of time. However, the client can benefit from loading resources as they're sent.
If the package is being transferred without authentication+integrity (e.g. HTTP instead of HTTPS), the client is out of luck and has to wait until the signatures arrive to use any of the package. We get no streaming benefit.
If the package is being transferred under TLS, the client can use that to infer trust in the content, and the server can send a signed manifest/index after all content is transferred. If there's some way to mark what's going to be in the package so the client knows to save it literally, the server can even transfer the files in multiple separate streams, and the client can assemble the package when the index arrives. Dynamically-generated resources can be transferred with the chunked encoding, but we could have the client rewrite that to MIME boundaries before serializing the package if we want, since the offsets aren't generated and signed until the end of the transfer.

One client wants to send the package to a peer, who will rely on the signature to trust the content, but who also might want to use the initial resources before the whole file has transferred.
This requires the signature block to be sent first, opposite case (1). As @bmeck mentioned, it's also possible to sign each resource independently, at the cost of more public-key operations. Nothing's being dynamically generated here, since the client couldn't sign dynamically generated content anyway, so MIME boundaries and fixed offsets are fine, although other ways of marking file boundaries are fine too. The peer can't use a file until the whole file is transferred, hashed, and verified, but they can use file1 before file2 has transferred.
The client or peer wants to re-use a package that's fully transferred, without needing to parse the whole thing from the beginning. This requires an index with offsets and sizes, and signed hashes. The index can be anywhere in the file as long as the peer can find it.

What have I missed?

lrosenthol · 2017-01-27T13:54:47Z

@jyasskin I am not sure those are the only three, but let's work through those.

1 - I would call this "streamed generation", and is a possible use case though I would consider it the least important. But regardless, let's work through it.
Your suggestion to utilize the trust of TLS towards the model is interesting, but not viable - as the type of trust one achieves from TLS is not comparable (or replaceable) by the trust inherent in signed content. Just because badactor.com has a good TLS cert, doesn't mean that I trust the Javascript code that comes from. So a client that is concerned with trusted content (which hopefully will be more and more as we solve this problem) wouldn't be able to use anything until all the content and the certs are down and can verify the trust of those certs.

2 - Not sure why this has to be client->peer. To me, this is server->client, where the content already exists (with or without signature). I agree that the certificate (and any associated trust chain and/or revocation info) has to be sent first - so that trust can be established. However, it doesn't require that the signed hash be sent until the end. The client can be validating the trust of the cert while streaming in the rest of the data (and the hash) and then only if it ends up trusting the cert does it even bother checking the hash and then (potentially) using the data.

jyasskin · 2017-01-27T17:37:09Z

If you're not concerned with "streamed generation", then I don't understand your comment at the top of this thread that "You cannot create it entirely in stream". Sorry for being dense.
I'm also not clear on the difference between TLS's trust and the trust in signed content. The two differences I know of are: A) TLS is repudiable since the signature only covers the shared symmetric key, not the content, and B) the TLS private key must be accessible in real time from the server, while signed content can protect the key better. If those are the only differences, I don't know why you'd trust a package signed by foo.com but wouldn't trust a TLS connection to foo.com. Could you enlighten me?
Agreed: I called it "client" because the key difference is that it doesn't have access to the private key. It could easily be a server that serves a pre-signed package. My point about sending signatures early is that you can't use a file before you've received its signature. If you send [certificates, file1, file2, ... fileN, signatures], you've unnecessarily delayed use of file1. You can definitely send [certificates, file1, sig1, file2, sig2, ...] instead of [certificates, signatures, file1, file2, ...], but if the package is pre-built I don't really see the benefit.

lrosenthol · 2017-01-27T19:02:47Z

I did make that comment up front, as at the time, I wasn't sure whether that was (or wasn't) a requirement. I would state that right now - no one has proposed a use case where streaming creation is required.

As you note, you can't use a TLS cert to sign content - so that any trust I have in that cert wouldn't apply to content. I would have to trust a different cert - and that other cert isn't tied to a domain but instead an organization or individual. See #16 for previous conversations in this area.

bmeck · 2017-01-27T20:14:07Z

Dropping streaming requirement seems fine to me. Only use case that we have is speeding up sending packages to registries when publishing.

bmeck · 2017-03-29T14:24:16Z

@jyasskin @lrosenthol @dimich-g are we fine to close this?

dimich-g · 2017-03-30T03:05:56Z

Closing. File a new issue if needed.

bmeck mentioned this issue Dec 29, 2016

[do not merge] MIME boundary => chunked encoding #23

Closed

dimich-g closed this as completed Mar 30, 2017

jamiebuilds mentioned this issue Feb 23, 2018

[RFC] Automatically extract modules shared across multiple bundles into their own bundle parcel-bundler/parcel#885

Closed

jyasskin mentioned this issue May 7, 2020

Serialization format optimized for streaming #577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming not supported #6

Streaming not supported #6

lrosenthol commented Nov 17, 2016

addyosmani commented Nov 17, 2016

dimich-g commented Nov 18, 2016

bmeck commented Nov 29, 2016

lrosenthol commented Nov 29, 2016

bmeck commented Dec 29, 2016

jyasskin commented Jan 26, 2017

lrosenthol commented Jan 27, 2017

jyasskin commented Jan 27, 2017

lrosenthol commented Jan 27, 2017

bmeck commented Jan 27, 2017

bmeck commented Mar 29, 2017

dimich-g commented Mar 30, 2017

Streaming not supported #6

Streaming not supported #6

Comments

lrosenthol commented Nov 17, 2016

addyosmani commented Nov 17, 2016

dimich-g commented Nov 18, 2016

bmeck commented Nov 29, 2016

lrosenthol commented Nov 29, 2016

bmeck commented Dec 29, 2016

jyasskin commented Jan 26, 2017

lrosenthol commented Jan 27, 2017

jyasskin commented Jan 27, 2017

lrosenthol commented Jan 27, 2017

bmeck commented Jan 27, 2017

bmeck commented Mar 29, 2017

dimich-g commented Mar 30, 2017