Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
'Freeze-dry' web page archiving #78
People should be able to keep web pages they have visited. The usual practice of just keeping the URL of the page (e.g. when bookmarking) is problematic. Pages are served in a single remote place, and have no versioning information, so one has to rely on that single authority to keep a document available. Browsers have always had a 'save page as..' option, but has had too little love. There are different ways to approach web page archival (e.g. recording all transactions), but a simple one, that can be achieved in a browser extension, is to save the rendered DOM as carefully as possible.
This issue is about making a script that takes a 'live' web page, i.e. a page that is currently displayed in a tab in a browser, and converts it to a single, static html file without external dependencies ('freeze-drying' it, though I'm open to other name suggestions). Opening the page in a normal browser should not trigger any connections to the outside world, while displaying the page as accurately as possible. Some things it should take care of:
In the process, metadata should be added to each inlined or rewritten element to keep a reference to the origins. For example, the original
The script could be created as a separate module, in a separate repo, to make it usable in other contexts. Assuming it would not require any APIs specific to browser extensions, it could be run on other platforms or be added pages to archive themselves. I suppose the module could provide a single function
There are at least two notable browser extensions that perform a similar archiving procedure, so some tricks may be borrowed from them: SingleFile (see especially
Could it be that SingleFile is not open source? At least they share no license.
I also found the MHTML format: https://en.wikipedia.org/wiki/MHTML
Mozilla also has a project/addon:
It is GPLv3 licensed.
I forgot about MHTML altogether; I like how it allows bundling multiple files while giving space for their original URLs and other metadata. It seems badly supported however, and without support it cannot be interpeted as a normal html file (it is formatted as an email!), so using
Oh :) not where I expected that file.
What a pity. Thanks for explaining.
Hi @Treora, glad to see this project taking off! We chatted not too long ago about possible collaboration.
I just wanted to throw in a few suggestions/thoughts about formats. In the web archiving world and with Webrecorder project, the WARC format is pretty standard. There is also the HAR format (http://www.softwareishard.com/blog/har-12-spec/), which is supported by at least Chrome and Firefox. You may be able to access these HAR export tools from browser extension apis, eg: https://developer.chrome.com/extensions/devtools_network
These formats are designed for storing the transactions (and there is now a tool for converting HAR->WARC). In Webrecorder, we also have a "Static Snapshot" option which tries to save the current state of the DOM and put it back into a WARC file. This turned out to be more tricky than it seemed at first, especially due to
Since Webrecorder already has the images and external links saved through the transactional recording, we need not worry about those, but we do remove all the
Of course, this approach will likely break video or other objects, or tags that refer to local storage, eg.
I guess my question would be: are you trying to produce a single HTML or do you want to use an existing format that contains HTML? If the latter, than I think it really makes sense to work with an existing format to maintain interoperability would be best.
Saving and replaying transactions like you do with WARC (or HAR) is indeed the right way to archive content as truthfully as possible. However, for a few reasons my idea is to now take a different approach, and save the rendered DOM instead.
So, to answer your question, I think I am trying to produce a single HTML file, and one reason is that it is an existing format that maintains interoperability: however, not interoperability with archiving tools, but with file managers, document viewers and editors. It would be nice to have both, but I do not see a pragmatic way to do this now. I'd be glad to hear your thoughts. A future idea could be to also create WARCs, for genuine archiving, but that would be a separate endeavour.
I forgot that webrecorder also supports making a static snapshot. Is processing the DOM (e.g. removing the scripts) done on the Python side? (or at least it seems not to happen in the page itself) Also thanks for the iframe