An utility to archive webpages through time
Ruby JavaScript
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
.gitignore preparing v0.0.3 Nov 1, 2011


An utility to archive webpages through time.

Takes snapshots and make incremental backups of webpages assets so you can follow the pages' evolutions through time.

  • Assets are stored in a git respository to simplify incremental storage and easy retrieval

  • Snapshots and thumbails are stored in a plain repository so they can easily be served by a webserver

  • List of webpages and archives instances are stored in an SQL database

  • Some caching data are stored in the same database

Required tools:


  • Install the required tools

  • Install the gem

  • All configuration items have default value, have a look bellow if you want to customize them (default database configuration require the sqlite3 gem)

  • Use it !: all the required files and database structure will be created at first call


The public API is provided by WebpageArchivist::WebpageArchivist, example:

require 'webpage-archivist'
archivist =
webpage = archivist.add_webpage('' , 'The New York Times')
archivist.fetch_webpages []

Models are available in the lib/webpag-archivist/models.rb file, have a look at the Sequel API if you want to querry them.


Basic configuration is done through environment variables:

  • DATABASE_URL : database url, default to sqlite://#{Dir.pwd}/webpage-archivist.sqlite3 syntax is described here, remember to add the required database gem

  • ARCHIVIST_ASSETS_PATH : path to store the assets, default to ./archivist_assets

  • ARCHIVIST_SNAPSHOTS_PATH : path to store the thumbnail, default to ./archivist_snapshots

  • ARCHIVIST_MAX_RUNNING_REQUESTS : number of elements requests running in parallel (not so important since requests are run using EventMachine, default to 20

  • PHANTOMJS_PATH: path to PhantomJS executable if they aren't in the path

  • GRAPHICS_MAGICK_PATH : path to GraphicsMagick executable if it isn't in the path

  • BACKGROUND_THREAD_POOL_SIZE: EventMachine pool size for background tasks like taking the snapshots (default to 20)

Configuration for snapshoting is done through the WebpageArchivist::Snapshoter class.

To enable debugging use

WebpageArchivist.log= true

Connect to the database / run migrations

The database connection is available as WebpageArchivist::DATABASE and if you want to run your own migrations use

require 'webpage-archivist/migrations'
WebpageArchivist::Migrations.migration 'create table foo' do
  WebpageArchivist::DATABASE.create_table :foos do
    primary_key :id
    # ...

this way your migrations will be run when the corresponding class is loaded

Released under the MIT license