html-archiver is a Python module for creating a self-contained HTML archive of a web page. It inlines everything on the page – CSS, JavaScript, images – so you have an entire copy of the page in a single file.
For a while I've used the page archiving tool on Pinboard, which takes a similar approach: inline CSS and JavaScript, grab a local copy of images, base64-encode certain elements. I wrote this because I wanted a couple of extra features:
- A tool that didn't rely on a third-party service
- Support for saving resources in
url()
values in CSS (e.g. webfonts) - A single-page output -- Pinboard saves images as separate files, and serves them alongside the HTML
It isn't perfect -- if you want a true archival copy, you'd be better off saving all the resources individually, and mimicing the web server layout -- but this is really aimed at making quick and cheap copies.
It can be invoked from the command line:
$ python3 html_archiver.py "http://example.org/foo/bar"
or from Python:
from html_archiver import HTMLArchiver
a = HTMLArchiver()
a.archive_url("http://example.org/foo/bar")
html-archiver attempts to produce a self-contained HTML archive. As well as downloading the HTML from a page, it:
Downloads any CSS and JavaScript dependencies. For example, the following:
<link rel="stylesheet" type="text/css" href="/css/style.css"> <script type="text/javascript" src="/js/app.js"></script>
would be replaced with:
<style> p { color: red; } ... </style> <script type="text/javascript"> function hello_world() { console.log("Hello world!"); } ... </script>
Replaces any images with base64-encoded data URIs. For example, the following:
<img src="/images/rainbow.jpg">
would be replaced with:
<img src="data:image/jpeg;base64,4QPaRXhp...">
Note: This only works for images supplied in the
src
attribute; thesrcset
attribute is not supported. See issue #1Any CSS rules that have a
url()
value are also replaced with base64-encoded data URIs.For pages that require login, you can pass a custom
Session
object with appropriate cookies toHTMLArchiver
, and it will use that session to download the page. For example:from archiver import HTMLArchiver from requests import Session sess = Session() sess.post('https://example.org/login', auth=('username', 'password')) archiver = HTMLArchiver(sess=sess) archiver.archive('https://example.org/logged_in_page')
Clone this repository and install dependencies with pip:
$ git clone https://github.com/alexwlchan/html-archiver.git
$ cd html-archiver
$ virtualenv env
$ source env/bin/activate
$ pip install -r requirements.txt
I develop and test against Python 2.7 and Python 3.3+.
If you find a bug, or a page that html-archiver misinterprets, please file an issue on the GitHub repo.
MIT.