Skip to content

'Freeze-dry' web page archiving #78

@Treora

Description

@Treora

Background

People should be able to keep web pages they have visited. The usual practice of just keeping the URL of the page (e.g. when bookmarking) is problematic. Pages are served in a single remote place, and have no versioning information, so one has to rely on that single authority to keep a document available. Browsers have always had a 'save page as..' option, but has had too little love. There are different ways to approach web page archival (e.g. recording all transactions), but a simple one, that can be achieved in a browser extension, is to save the rendered DOM as carefully as possible.

Scope

This issue is about making a script that takes a 'live' web page, i.e. a page that is currently displayed in a tab in a browser, and converts it to a single, static html file without external dependencies ('freeze-drying' it, though I'm open to other name suggestions). Opening the page in a normal browser should not trigger any connections to the outside world, while displaying the page as accurately as possible. Some things it should take care of:

  • Images are to be inlined as data: URIs.
  • External stylesheets can be collected and nested in <style> elements.
  • Scripts, including event handlers in element attributes, have to be removed. (in case somebody figures out a way to allow simple&safe scripts, that would be great, but it seems far out of scope)
  • Other embeds, objects, and some types of links may have to be removed or rewritten.

In the process, metadata should be added to each inlined or rewritten element to keep a reference to the origins. For example, the original src attribute of an img tag would be moved to another attribute, perhaps using RDFa and a standard vocabulary: <img src="data:image/png,....." rel="dc:source" resource="http://example.org/original_location.png" />. The exact choice of what&how to store could be decided/improved later (at least we should also register the date of retrieval). The document as a whole should in the same manner get some appropriate metadata to inform about its origin.

The script could be created as a separate module, in a separate repo, to make it usable in other contexts. Assuming it would not require any APIs specific to browser extensions, it could be run on other platforms or be added pages to archive themselves. I suppose the module could provide a single function freezeDry(dom=window.document, options={}) that returns the new html file (as a string/Blob/DOM object?).

Prior art

There are at least two notable browser extensions that perform a similar archiving procedure, so some tricks may be borrowed from them: SingleFile (see especially docprocessor.js), and Scrapbook (I found no repo of it online, so download and unzip the addon to get its source code.. look at chrome/content/scrapbook/saver.js). If anyone knows of more noteworthy or (ideally) reusable code, let know.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions