Skip to content

Latest commit

 

History

History
267 lines (180 loc) · 30.5 KB

File metadata and controls

267 lines (180 loc) · 30.5 KB

Compression dictionary transport

What is this?

This explainer outlines the benefits of compression dictionaries, details the different use case for them, and then proposes a way to deliver such dictionaries to browsers to enable these use cases.

The HTTP headers and negotiation are specified in the IETF Draft document for Compression Dictionary Transport.

Summary

This proposal adds support for using designated previous responses as an external dictionary for HTTP responses for compression schemes that support external dictionaries (e.g. Brotli and Zstandard).

HTTP Content-Encoding is extended with new encoding types and support for allowing responses to be used as dictionaries for future requests. All actual header values and names still TBD:

  • Server responds to a request for a cacheable resource with a Use-As-Dictionary: <options> response header.
  • The client will store a hash of the uncompressed response and the applicable match URL pattern for the resource with the cached response to identify it as a dictionary.
  • On future requests, the client will match a request against the available dictionary match URL patterns. If multiple patterns are matched, the most-specific match is used. If a dictionary is available for a given request, the client will add an appropriate compression scheme (e.g. br-d for shared brotli) to the Accept-Encoding request header as well as an Available-Dictionary: <sf-binary SHA-256> header with the hash of the best available dictionary. The hash is sent as a Structured Field Byte Sequence (base64-encoded, enclosed by colons). e.g. Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:.
  • If the server has a compressed version of the request URL with the matching dictionary, it serves the dictonary-compressed response with the applicable Content-Encoding: (e.g. br-d) and Vary: Accept-Encoding,Available-Dictionary.

For interop reasons, dictionary-based compression is only supported on secure contexts (similar to brotli compression).

There are also some browser-specific features independent of the transport compression:

  • For security and privacy reasons, there are CORS requirements (detailed below) for both the dictionary and compressed resource.
  • In order to populate a dictionary for future use, a server can respond with link tag or header to trigger an idle-time fetch specifically for a dictionary for future use. e.g. <link rel=dictionary href=[dictionary_url]>.

Background

What are compression dictionaries?

Compression dictionaries are bits of compressible content known ahead of time. They are being used by compression engines to reduce the size of compressed content.

Because they are known ahead of time, the compression engine can refer to the content in the dictionary when representing the compressed content, reducing the size of the compressed payload. The decompression engine can then interpret the content based on that pre-defined knowledge..

Taken to the extreme, if the compressed content is identical to the dictionary, the entire delivered content be a few bytes referring to the dictionary.

Now, you may ask, if dictionaries are so awesome, then...

Why aren't browsers already using compression dictionaries?

To some extent, they are. The brotli compression scheme includes a built-in dictionary that was built to work reasonably well for HTML, CSS and JavaScript. Custom (shared) dictionaries have a more complicated history.

At some point, Chrome did support a shared compression dictionary. When Chrome was first released, it supported a dictionary compression method called SDCH (Shared-dictionary Compression over HTTP). That support was unshipped in 2016 due to complexities around the protocol’s implementation, specification and lack of an interoperability story.

SDCH enabled Chrome and Chromium-based browsers to create origin-specific dictionaries, that were downloaded once for the origin and enabled multiple pages to be compressed with significantly higher rates. That's one use case for compression dictionaries we will call the "Shared dictionary" use case.

There's another major use case for shared dictionaries that was never supported by browsers - delta compression.

That use-case would enable the browser to reuse past resources (e.g. your site's main JS v1.2) in order to compress future ones (e.g. main JS v1.3). But traditionally, this use-case raised complexities around the abilities of the browser to coordinate its cache state with the server, and agree on what the dictionary would be. It also raised issues with both sides having to store all past versions of each resource in order to successfully be able to compress and decompress it.

The common thread is that the use of compression dictionaries had run into various complexities over the years which resulted in deployment issues.

This time will be different

A few things about this current proposal are different from past attempts, in ways we're hoping are meaningful:

  • CORS-based restrictions can ensure that public and private resources don't get mixed in ways that can leak user data.
  • Same-origin, path and destination-based matching would help us manage a "single possible dictionary per request" policy, which will minimize client-side cache fan-out.
  • Dictionaries must already be available on the client to be used (fetching of the dictionary is not in the critical path of a resource fetch).
  • Diff-caching on the server can simplify and enable the server-side deployment story.

Use cases

Compression types

There are two primary models for using shared dictionaries that are similar but differ in how the dictionary is fetched:

  • Delta compression - reusing past downloaded resources for compressing future updates of the same or similar resources.
  • Shared dictionary - a dedicated dictionary is downloaded out-of-band, and then used to compress and decompress resources on the page.

In both cases the client advertises the best-available dictionary that it has for a given request. If the server has a delta-compressed version of the resource, compressed with the advertized dictionary, it can just send that delta-compressed diff. It can also use that advertized dictionary (if available) to dynamically compress that resource.

With the Delta compression use case, a previously-downloaded version of the resource is available to use for future requests as a dictionary. For example, with a JavaScript file, v1 of the file may be in the browser's cache and available for use as a dictionary to use when fetching v2 so only the difference between the two needs to be transmitted.

In the Shared dictionary use case, the dictionary is a purpose-built dictionary that is fetched using a <link> tag and can be used for future requests that match the match URL pattern covered by the dictionary. For example, on a first visit to a site, the HTML response references a custom dictionary that should be used for document fetches for that origin. The dictionary is downloaded at some point by the browser and, on future navigations through the site, is advertised as being available for document requests that match the URL pattern that the dictionary applies to.

Risks

Security

The Shared Brotli draft does a good job describing the security risks. In summary:

  • CRIME and BREACH mean that both the resource being compressed and the dictionary itself can be considered readable by the document deploying them. That is Bad™ if any of them contains information that the document cannot already obtain by other means.
  • An out-of-band dictionary needs to be carefully examined to ensure that it wasn’t created using users’ private data, nor using content that’s user controlled.

Privacy

Dictionaries will need to be cached using a triple key (top-level site, nested context site, URL) similar to other cached resources (or any other partitioning scheme that’s good enough for cached resources and cookies from a privacy and security perspective). That’s not an issue for the delta compression use case, but can become a burden fast for the out-of-band dictionaries, as multiple nested contexts may need to download the same dictionary multiple times.

Note: Common payload caching may be useful in such cases.

There’s also the issue of users advertising resource versions in their cache to servers as part of the request. This already has a precedence in terms of cache validators (ETags, If-Modified-Since), so maybe that’s fine, given that the cache is partitioned.

Adverse performance effects

Downloading an out-of-band dictionary means that the site owner is making a certain bet regarding the amount of visits that would enable the user to amortize that dictionary’s cost.

At worst, if the user never visits the site again until the dictionary’s lifetime expires, the user has paid the cost of downloading the dictionary with no benefits.

For some large and heavily trafficked sites, that case is rare. For others, it’s extremely common, and we should be wary of both the tools we’d be putting in developers’ hands, as well as the messaging we’re providing them regarding when to use them.

Proposal

Static resources flow

In this flow, we’re reusing static resources themselves as dictionaries that would be used to compress future updates of themselves, or similar resources.

  • example.com downloads example.com/large-module.wasm for the first time.
  • The response for example.com/large-module.wasm contains a Use-As-Dictionary: <options> response header. The options are a structured field dictionary that includes the ability to set a URL-matching pattern, matching fetch destination, and an opaque identifier. More details here.
  • The client saves the URL pattern, destination (if provided), ID and a SHA-256 hash of the resource with the cached resource.
    • For browser clients, the response must also be non-opaque in order to be used as a dictionary. Practically, this means the response is either same-origin as the document or is a cross-origin request with an Access-Control-Allow-Origin: response header that makes the response readable by the document.
  • The next time the browser fetches a resource from a URL that matches a pattern covered by a dictionary in cache and with a fetch destination that matches the provided destination, it includes an Available-Dictionary: request header, which lists a single hash (encoded as a Structured Field Byte Sequence).
    • The request is limited to specifying a single dictionary hash both to reduce the header overhead and limit the cardinality of the Available-Dictionary: request header (to limit variations in the Vary caches).
    • If there is an ID associated with the dictionary then it is sent in a separate Dictionary-ID request header.
    • Any new resource as a dictionary with the same URL-matching pattern would override older ones. When sending requests, the browser would use the most specific match for the request to get its dictionary. Specificity is determined by the string length of the match pattern specified with the dictionary.
  • When the server gets a request with the Available-Dictionary header in it:
    • If the client sent a sec-fetch-mode: cors request header then the dictionary should be ignored unless the response will have an Access-Control-Allow-Origin: response header that includes the origin of the page the request was issued from (* or matched against the origin: or referer:).
    • The server can simply ignore the dictionary if it doesn't have a diff that corresponds to said dictionary. In that case the server can serve the response without delta compression.
    • If the server does have a corresponding diff, it can respond with that, indicating that as part of its Content-Encoding header as well as a Content-Dictionary response header with the hash of the dictionary that was used (must match the hash from the Available-Dictionary request header).
      • For example, if we're using shared brotli compression, the Accept-Encoding: deflate, gzip, br, br-d request would respond with Content-Encoding: br-d.
  • In case the browser advertized a dictionary but then fails to successfully fetch it from its cache and the dictionary was used by the server, the resource request should fail.
  • For browser clients, the response must be non-opaque in order to be decompressed with a shared dictionary. Practically, this means the response is either same-origin as the document or is a cross-origin request with an Access-Control-Allow-Origin: response header that makes the response readable by the document.

Dynamic resources flow

  • Shared dictionary is declared ahead-of time and then downloaded out of band using a Link: header on the document response or <link> HTML tag with a rel=dictionary type.
    • The dictionary resource will be downloaded with CORS in “omit” mode to discourage including user-specific private data in the dictionary, since its data will be readable without credentials.
    • It will be downloaded with “idle” priority, once the site is actually idle.
    • Browsers may decide to not download it when they suspect that the user is paying for bandwidth, or when used by sites that are not likely to amortize the dictionary costs (e.g. sites that the user isn’t visiting frequently enough).
    • Browsers may decide to not use a shared dictionary if it contains hints that its contents are not public (e.g. Cache-Control: private headers).
  • The dictionary response must include the Use-As-Dictionary: <options> header, appropriate cache lifetime headers and will be used for future requests using the same process as the Static resources flow.
    • For browser clients, the response must also be non-opaque in order to be used as a dictionary. Practically, this means the response is either same-origin as the document or is a cross-origin request with an Access-Control-Allow-Origin: response header that makes the response readable by the document.

Dictionary options header

The Use-As-Dictionary: response header is a structured field dictionary that allows for setting multiple options and for future expansion. The supported options and defaults are:

  • match - URL-matching pattern for the dictionary to apply to. Required. This is a patternString for a URLPattern URLPattern(patternString, baseURL) constructor where the baseURL is the URL of the request and where support for regexp tokens is disabled. URLPattern allows for absolute or relative URLs. e.g. /app1/main* will match https://www.example.com/app1/main_12345.js and main* in response to https://www.example.com/app1/main_1.js will match https://www.example.com/app1/main.xyz.js. Dictionaries will only match requests from the same origin as the dictionary.
  • match-dest - An optional Structured Field Inner List of string values of matching request destinations. The default value is An empty list (()) which will match all request destinations.
  • id - An optional server-provided dictionary ID string. The string is opaque to the client and echoed back to the server in a Dictionary-ID request header when the dictionary matches an outbound request. The default value is an empty string ("").

For example: use-as-dictionary: match="/app1/main*", match-dest=("script"), id="xxx" would specify matching on a path prefix of /app1/main for script requests and to send Dictionary-ID: "xxx" for any requests that match the dictionary.

Compression algorithms

The dictionary negotiation is independent of the compression algorithm that is used for compressing the HTTP response and is designed to support any compression scheme that supports using external compression dictionaries. Currently that includes Brotli and Zstandard but it is not limited to those (and depends on the what the client and server both support). It is likely that, in the future, content-specific compression schemes that handle delta-compression better may be built (i.e. code-aware Wasm compression).

The compression algorithm negotiation uses the regular Accept-Encoding:/Content-Encoding: negotiation that is used for non-dictionary compression. It is important that new names are registered with the HTTP Content Coding Registry for algorithms that use an external dictionary to prevent situations where processing along the request flow may attempt to decode a response using just the algorithm without being dictionary-aware. That way, if anything in the request flow needs to operate on the decoded content, it can either be made aware of the dictionary-based compression or it can modify the Accept-Encoding: request header to only support schemes that it is aware of (already common practice).

The examples in this document will use br-d for dictionary-based Brotli compression but the actual algorithm(s) negotiated could be anything that the client supports.

Compression API

The compression API can also expose support for using caller-supplied dictionaries but that is out-of-scope for this proposal.

Websockets

Websocket support is out-of-scope for this proposal but there is nothing in the current dictionary negotiation that precludes websockets from being able to build dictionary-based compression (either by leveraging parts of what is provided here or building something separate).

Security and Privacy

Dictionary and Resource readability (CORS)

Since the contents of the dictionary and compressed resource are both effectively readable through side-channel attacks, this proposal makes it explicit and requires that both be CORS-readable from the document origin. The origin for the URL the dictionary was served from and the origin of the match pattern for URLs MUST be the same (i.e. the dictionary and compressed resource must both be from the same origin).

For dictionaries and resources that are same-origin as the document, no additional requirements exist as both are CORS-readable from the document context. For navigation requests, their resource is by definition same-origin as the document their response will eventually commit. As a result, the dictionaries that match their URL pattern are similarly same-origin.

For dictionaries and resources served from a different origin than the document, they must be CORS-readable from the document origin. e.g. Access-Control-Allow-Origin: <document origin or *>. This means that any crossorigin content that is fetched in no-cors mode by default must enable CORS-fetching (usually with the crossorigin attribute).

When sending a CORS request with an available dictionary, a browser should only include the Available-Dictionary: header if it is also sending the sec-fetch-mode: header so a CORS-readable decision can be made on the server before responding.

In order to prevent sending dictionary-compressed responses that the client will not be able to process, when a server receives a request with sec-fetch-mode: cors as well as a Available-Dictionary: dictionary, it should only use the dictionary if the response includes a Access-Control-Allow-Origin: response header that includes the origin of the page the request was made from. Either by virtue of Access-Control-Allow-Origin: * covering all origins or if Access-Control-Allow-Origin: includes the origin in the origin: or referer: request header. If there is no origin: or referer: request header and Access-Control-Allow-Origin: is not * then the dictionary should not be used.

To discourage encoding user-specific private information into the dictionaries, any out-of-band dictionaries fetched using a <link> will be uncredentialed fetches.

These protections against compressing opaque resources make CORB and ORB considerations unnecessary as they are specific to protecting opaque resources.

Fingerprinting

The existence of a dictionary is effectively a cookie for any requests that match it and should be treated as such:

  • Storage partitioning for dictionary resource metadata should be at least as restrictive as for cookies.
  • Dictionary entries (or at least the metadata) should be cleared any time cookies are cleared.

The existence of support for dictionary-based Accept-Encoding: has the potential to leak client state information if not applied consistently. If the browser supports dictionary-based compression algorithms encoding then it should always be advertised, independent of the current state of the feature. Specifically, this means that in any private browsing mode (Incognito in Chrome), dictionary-based algorithm support should still be advertised even if the dictionaries will not persist so that the state of the private browsing mode is not exposed.

Triggering dictionary fetches

The explicit fetching of a dictionary through a <link rel=dictionary> tag or Link: header is functionally equivalent to <link rel=preload> with different priority and should be treated as such. This means that the Link: header is only effective for document navigation responses and can not be used for subresource loads.

This prevents passive resources, like images, from using the dictionary fetch as a side-channel for sending information.

Cache/CDN considerations

Any caches between the server and the client will need to be able to support Vary on both Accept-Encoding and Available-Dictionary, otherwise the responses will be either corrupt (in the case of serving a dictionary-compressed resource with the wrong dictionary) or ineffective (serving a non-dictionary-compressed resource when dictionary compression was possible).

Any middle-boxes in the request flow will also need to support the dictionary-compressed content-encoding, either by passing it through unmodified or by managing the appropriate dictionaries and compressed resources.

Examples

Bundled JavaScript on separate origin

In this example, www.example.com will use a bundle of application JavaScript that they serve from a separate static domain (static.example.com). The JavaScript files are versioned and have a long cache time, with the URL changing when a new version of the code is shipped.

On the initial visit to the site:

  • The browser loads https://www.example.com/ which contains <script src="//static.example.com/app/main.js/123" crossorigin> (where 123 is the build number of the code).
  • The browser requests https://static.example.com/app/main.js/123 with Accept-Encoding: br-d,br,gzip.
  • The server for static.example.com responds with the file as well as Use-As-Dictionary: match="/app/main.js*", Access-Control-Allow-Origin: https://www.example.com and Vary: Accept-Encoding,Available-Dictionary.
  • The browser caches the js file along with a SHA-256 hash of the decompressed file and the https://www.example.com/app/main.js* URL pattern.
sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<script src="//static.example.com/app/main.js/123" crossorigin>...
Browser->>static.example.com: GET /app/main.js/123<br/>Accept-Encoding: br,gzip
static.example.com->>Browser: Use-As-Dictionary: match="/app/main.js"<br/>Access-Control-Allow-Origin: https://www.example.com<br/>Vary: Accept-Encoding,Available-Dictionary

At build time, the site developer creates delta-compressed versions of main.js using previous builds as dictionaries, storing the delta-compressed version along with the SHA-256 hash of the dictionary used (e.g. as main.js.<hash>.br-d).

On a future visit to the site after the application code has changed:

  • The browser loads https://www.example.com/ which contains <script src="//static.example.com/app/main.js/125" crossorigin>.
  • The browser matches the https://www.example.com/app/main.js/125 request with the https://www.example.com/app/main.js* URL pattern of the previous dictionary response that is in cache and requests https://static.example.com/app/main.js/125 with Accept-Encoding: br-d,br,gzip, sec-fetch-mode: cors and Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:. For this example, the hash value from the header would need to be re-encoded as a filesystem-safe version of the hash before looking for the file (bas64-decode the header value and hen hex-encode the hash).
  • The server for static.example.com matches the URL and hash with the pre-compressed artifact from the build and responds with it and Content-Encoding: br-d, Access-Control-Allow-Origin: https://www.example.com, Vary: Accept-Encoding,Available-Dictionary, and Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=: response headers.

It could have also included a new Use-As-Dictionary: match="/app/main.js*" response header to have the new version of the file replace the old one as the dictionary to use for future requests for the path but that is not a requirement for the existing dictionary to have been used.

sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<script src="//static.example.com/app/main.js/125" crossorigin>...
Browser->>static.example.com: GET /app/main.js/125<br/>Accept-Encoding: br-d,br,gzip<br/>sec-fetch-mode: cors<br/>Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:
static.example.com->>Browser: Content-Encoding: br-d<br/>Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:<br/>Access-Control-Allow-Origin: https://www.example.com<br/>Vary: Accept-Encoding,Available-Dictionary

Site-specific dictionary used for all document navigations in a part of the site

In this example, www.example.com has a custom-built dictionary that should be used for all navigation requests to /product.

On the initial visit to the site:

  • The browser loads https://www.example.com/ which contains <link rel=dictionary href="/dictionaries/product_v1.dat">.
  • At an idle time, the browser sends an uncredentialed fetch request for https://www.example.com/dictionaries/product_v1.dat.
  • The server for www.example.com responds with the dictionary contents as well as use-as-dictionary: match="/product/*", match-dest=("document"), id="product_v1" and appropriate caching headers.
  • The browser caches the dictionary file along with a SHA-256 hash of the decompressed file and the https://www.example.com/product/* URL pattern, the document destination and the product_v1 dictionary ID.
sequenceDiagram
Browser->>www.example.com: GET /
www.example.com->>Browser: ...<link rel=dictionary href="/dictionaries/product_v1.dat">...
Browser->>www.example.com: GET /dictionaries/product_v1.dat<br/>Accept-Encoding: br,gzip
www.example.com->>Browser: use-as-dictionary: match="/product/*", match-dest=("document"), id="product_v1"

At some point after the dictionary has been fetched, the user clicks on a link to https://www.example.com/product/myproduct:

  • The browser matches the /product/myproduct request with the https://www.example.com/product/* URL pattern of the previous dictionary request as well as the document request destination and requests https://www.example.com/product/myproduct with Accept-Encoding: br-d,br,gzip, Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=: and Dictionary-ID: "product_v1" request headers.
  • The server supports dynamically compressing responses using available dictionaries and has the dictionary with the same ID and hash available and responds with a brotli-compressed version of the response using the specified dictionary as well as Content-Encoding: br-d and Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=: response headers.
sequenceDiagram
Browser->>www.example.com: GET /product/myproduct<br/>Accept-Encoding: br-d,br,gzip<br/>Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:<br/>Dictionary-ID: "product_v1"
www.example.com->>Browser: Content-Encoding: br-d<br/>Content-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:

Changelog

These are the changes that have been made to the specs as it has progressed through various standards organizations and based on developer feedback during browser experiments.

Feb 2023

  • The Sec-Available-Dictionary request header changed to Available-Dictionary.
  • The value of the Available-Dictionary request header changed to be a Structured Field Byte Sequence (base-64 encoding of the dictionary hash, surrounded by colons) instead of hex-encoded string.
  • The content encoding string for brotli with a dictionary changed from sbr to br-d.
  • The match field of the Use-As-Dictionary response header is now a URLPattern.
  • The expiration of the dictionary now uses the cache expiration of the dictionary resource instead of a separate expires.
  • The server can provide an id in the Use-As-Dictionary response header which is echoed in the Dictionary-ID request header by the client in future requests.
  • The server needs to send a Content-Dictionary response header with the hash of the dictionary used when compressing a response with a dictionary (must match the Available-Dictionary from the request).
  • match-dest was added to the Use-As-Dictionary response header to allow for matching on fetch destinations (e.g. match-dest="document" and have the dictionary only be used for document requests).