Design Doc: Parsing and Rewriting Framework

Parsing and Rewriting Framework

Joshua Marantz, November 2010

Apache Environment

The rewriter will live in apache as an output filter. Apache filters operate on a stream of bytes. While it is feasible for a filter to collect the entire HTML file before operating on it and passing through to the next filter, this can harm the user experience by preventing a requested page from displaying until it has been fully generated. It is possible for dynamically generated pages to present a layout first, and fill in details as they become available. Collecting the entire output in a filter before sending downstream defeats that experience. Only the site owner can determine whether streaming is important, and to what degree.

The challenge of streaming is that if a rewriting module has already passed through some of the HTML on a page, it's too late to rewrite it based on markup seen later on. This can limit our ability to fully sprite all the images on a page. However, it does not prevent us from spriting within the chunk of markup that is transferred into the filter, and it does not hamper single-image optimization at all.

There is an inescapable tradeoff between rewriting and streaming of HTML content. Rewriting will be most effective when then entire DOM can be analyzed and updated. On the other hand, a very long page can be painted in the browser before it's fully generated by the server, so buffering it completely will increase the time-to-first-paint.

In particular, spriting of images, css, and javascript can only be performed on content that is buffered. That does not mean that a streaming defeats spriting, it just limits its scope. However, leaf-based optimizations can be performed even in a highly streamed environment. Such optimizations include:

Extending the cache lifetime of resources by signing them
Optimizing images based on their dimensions
Minifying JS, CSS, and HTML.

The goal of our C++ rewriter is to do the best job it can rewriting the HTML in the chunks provided. The Apache configuration can help determine

SAX vs DOM Parsing

XML parsers come in two varieties. DOM parsers read in the entire file and then provide an API into the tree structure. SAX parsers provide an event-driven model where the parser calls user-supplied callbacks to handle constructs as they are seen by the parser. We are going to use a SAX-based parser called libxml2. While this is widely known as Gnome's resident XML parser, it has a robust HTML parsing interface as well. We are in the process of mechanically validating its correctness on the top web sites, and have visually validated it on MSN, NYTimes, Yahoo, Google, HuffingtonPost and CNN. It achieves this through a flexible and (evidently) well-tuned recovery mechanism for mismatched tags.

SAX enables us to provide a streaming interface to the HTML content, in a form where it is easy to rewrite tags in isolation, or in context with other tags that fall within the same flush-buffer.

SAX does not preclude us from parsing the entire document so that our rewrites are maximally effective. This will be under control of the Apache administrator, via module section or configuration. It is easy to build an efficient DOM using a SAX parser, but it is not possible to provide a responsive event-driven interface using a DOM parser.

HTML Rewriting using SAX

Using the parsing technology describe above, we establish a rewriting technology based on filter chains. Each rewriter is a filter that processes a stream of SAX events (HtmlEvent), interspersed by Flushes. The filter operates on the sequence of events between flushes (a flush-window), and the system passes the (possibly mutated) event-stream to the next filter.

An HTML Event is a lexical token provided by the parser, including:

begin document
end document
begin element
end element
whitespace
characters
cdata
comment

To implement this, we retain the sequence of events as a data structure: list<HtmlEvent>. HtmlEvents are sent to filters (HtmlFilter), as follows:

foreach filter in filter-chain
  foreach event in flush-window
    apply filter to event

Resource Serving

When an HTML filter rewrites a resource (image, javascript, css) it must ensure that the rewritten resource can be served when requested by the browser. There are four mechanisms for serving resources: Place the rewritten resource in a database, shared file system, or shared cache, where every server can see it. Encode enough information (origin URLs, metadata) for rewriting the resource in the URL. Push the resource to a content delivery network (CDN). Inline the resource as a literal in the rewritten HTML. Taking these in reverse order:

Inlining: We may choose to serve resources inlined or as discrete resources based on our analysis of the network bandwidth, parallelism, and congestion window based on a client's browser (user-agent) and other detection technology we could employ at runtime.

Serving from CDN: Assuming we are serving resources discretely (not inlining them), certainly the preferred mechanism, when available, is to push the resource to a CDN, such as Akamai or Google (GGS). Such an option is dependent on a business relationship between the site owner and the CDN. For sites that use CDNs to serve their resources, our goal is to continue to use the same CDN to serve the rewritten resources, otherwise we run the risk of increasing latency to end users.

Encoding rewrite instructions in URL: Assume we are rewriting two CSS links, combining them into one link, and serving the combined file.

Provide feedback

Saved searches