Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
An utility to archive webpages through time
Ruby JavaScript
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
lib
test
.gitignore
CHANGELOG.md
Gemfile
README.rdoc
Rakefile starting to code and to test
webpage-archivist.gemspec

README.rdoc

An utility to archive webpages through time.

Takes snapshots and make incremental backups of webpages assets so you can follow the pages' evolutions through time.

  • Assets are stored in a git respository to simplify incremental storage and easy retrieval

  • Snapshots and thumbails are stored in a plain repository so they can easily be served by a webserver

  • List of webpages and archives instances are stored in an SQL database

  • Some caching data are stored in the same database

Required tools:

Installation

  • Install the required tools

  • Install the gem

  • All configuration items have default value, have a look bellow if you want to customize them (default database configuration require the sqlite3 gem)

  • Use it !: all the required files and database structure will be created at first call

API

The public API is provided by WebpageArchivist::WebpageArchivist, example:

require 'webpage-archivist'
archivist = WebpageArchivist::WebpageArchivist.new
webpage = archivist.add_webpage('http://www.nytimes.com/' , 'The New York Times')
archivist.fetch_webpages [webpage.id]

Models are available in the lib/webpag-archivist/models.rb file, have a look at the Sequel API if you want to querry them.

Configuration

Basic configuration is done through environment variables:

  • DATABASE_URL : database url, default to sqlite://#{Dir.pwd}/webpage-archivist.sqlite3 syntax is described here, remember to add the required database gem

  • ARCHIVIST_ASSETS_PATH : path to store the assets, default to ./archivist_assets

  • ARCHIVIST_SNAPSHOTS_PATH : path to store the thumbnail, default to ./archivist_snapshots

  • ARCHIVIST_MAX_RUNNING_REQUESTS : number of elements requests running in parallel (not so important since requests are run using EventMachine, default to 20

  • PHANTOMJS_PATH: path to PhantomJS executable if they aren't in the path

  • GRAPHICS_MAGICK_PATH : path to GraphicsMagick executable if it isn't in the path

  • BACKGROUND_THREAD_POOL_SIZE: EventMachine pool size for background tasks like taking the snapshots (default to 20)

Configuration for snapshoting is done through the WebpageArchivist::Snapshoter class.

To enable debugging use

WebpageArchivist.log= true

Connect to the database / run migrations

The database connection is available as WebpageArchivist::DATABASE and if you want to run your own migrations use

require 'webpage-archivist/migrations'
WebpageArchivist::Migrations.migration 'create table foo' do
  WebpageArchivist::DATABASE.create_table :foos do
    primary_key :id
    # ...
  end
end

WebpageArchivist::Migrations.new.run

this way your migrations will be run when the corresponding class is loaded

Released under the MIT license

Something went wrong with that request. Please try again.