Skip to content
Ruby gem to send URLs to Wayback Machine
Branch: master
Clone or download
dependabot-bot and buren Update rake requirement from ~> 10.3 to ~> 12.3
Updates the requirements on [rake](https://github.com/ruby/rake) to permit the latest version.
- [Release notes](https://github.com/ruby/rake/releases)
- [Changelog](https://github.com/ruby/rake/blob/master/History.rdoc)
- [Commits](ruby/rake@v10.3.0...v12.3.2)

Signed-off-by: dependabot[bot] <support@dependabot.com>
Latest commit 3a14796 Apr 14, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin
lib Add support for multiple hosts Apr 7, 2019
spec Re-add coverage code Sep 23, 2017
.gitignore Rewrite - CLI, Proper concurrency, added test suite, proper Sitemap Jul 29, 2017
.travis.yml Add ruby 2.0 to Travis CI build matrix. Closes #18 Aug 6, 2017
CHANGELOG.md Bump version from 1.2.1 to 1.3.0 and update CHANGELOG Jan 24, 2019
Gemfile Initial commit Jul 17, 2014
LICENSE Renamed LICENSE file Apr 2, 2015
README.md Remove gemnasium badge Jan 24, 2019
Rakefile Fix rake console task Apr 7, 2019
wayback_archiver.gemspec

README.md

WaybackArchiver

Post URLs to Wayback Machine (Internet Archive), using a crawler, from Sitemap(s), or a list of URLs.

The Wayback Machine is a digital archive of the World Wide Web [...] The service enables users to see archived versions of web pages across time ...
- Wikipedia

Build Status Code Climate Docs badge Gem Version

Index

Installation

Install the gem:

$ gem install wayback_archiver

Or add this line to your application's Gemfile:

gem 'wayback_archiver'

And then execute:

$ bundle

Usage

Strategies:

  • auto (the default) - Will try to
    1. Find Sitemap(s) defined in /robots.txt
    2. Then in common sitemap locations /sitemap-index.xml, /sitemap.xml etc.
    3. Fallback to crawling (using the excellent spidr gem)
  • sitemap - Parse Sitemap(s), supports index files (and gzip)
  • urls - Post URL(s)

Ruby

First require the gem

require 'wayback_archiver'

Configuration (the below values are the defaults)

WaybackArchiver.concurrency = 5
WaybackArchiver.user_agent = WaybackArchiver::USER_AGENT
WaybackArchiver.logger = Logger.new(STDOUT)
WaybackArchiver.max_limit = -1 # unlimited
WaybackArchiver.adapter = WaybackArchiver::WaybackMachine # must implement #call(url)

For a more verbose log you can configure WaybackArchiver as such:

WaybackArchiver.logger = Logger.new(STDOUT).tap do |logger|
  logger.progname = 'WaybackArchiver'
  logger.level = Logger::DEBUG
end

Pro tip: If you're using the gem in a Rails app you can set WaybackArchiver.logger = Rails.logger.

Examples:

Auto

# auto is the default
WaybackArchiver.archive('example.com')

# or explicitly
WaybackArchiver.archive('example.com', strategy: :auto)

Crawl

WaybackArchiver.archive('example.com',  strategy: :crawl)

Only send one single URL

WaybackArchiver.archive('example.com', strategy: :url)

Send multiple URLs

WaybackArchiver.archive(%w[example.com www.example.com], strategy: :urls)

Send all URL(s) found in Sitemap

WaybackArchiver.archive('example.com/sitemap.xml', strategy: :sitemap)

# works with Sitemap index files too
WaybackArchiver.archive('example.com/sitemap-index.xml.gz', strategy: :sitemap)

Specify concurrency

WaybackArchiver.archive('example.com', strategy: :auto, concurrency: 10)

Specify max number of URLs to be archived

WaybackArchiver.archive('example.com', strategy: :auto, limit: 10)

Each archive strategy can receive a block that will be called for each URL

WaybackArchiver.archive('example.com', strategy: :auto) do |result|
  if result.success?
    puts "Successfully archived: #{result.archived_url}"
  else
    puts "Error (HTTP #{result.code}) when archiving: #{result.archived_url}"
  end
end

Use your own adapter for posting found URLs

WaybackArchiver.adapter = ->(url) { puts url } # whatever that responds to #call

ℹ️ This gem uses the spidr gem that has a bug in the version that is pushed to RubyGems, it's fixed in the master branch. Simply add gem 'spidr', github: 'postmodern/spidr' to your Gemfile to use the fixed version. See #25 for details.

CLI

Usage:

wayback_archiver [<url>] [options]

Print full usage instructions

wayback_archiver --help

Examples:

Auto

# auto is the default
wayback_archiver example.com

# or explicitly
wayback_archiver example.com --auto

Crawl

wayback_archiver example.com --crawl

Only send one single URL

wayback_archiver example.com --url

Send multiple URLs

wayback_archiver example.com www.example.com --urls

Crawl multiple URLs

wayback_archiver example.com www.example.com --crawl

Send all URL(s) found in Sitemap

wayback_archiver example.com/sitemap.xml

# works with Sitemap index files too
wayback_archiver example.com/sitemap-index.xml.gz

Most options

wayback_archiver example.com www.example.com --auto --concurrency=10 --limit=100 --log=output.log --verbose

View archive: https://web.archive.org/web/*/http://example.com (replace http://example.com with to your desired domain).

Docs

You can find the docs online on RubyDoc.

This gem is documented using yard (run from the root of this repository).

yard # Generates documentation to doc/

Contributing

Contributions, feedback and suggestions are very welcome.

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request

License

MIT License

References

You can’t perform that action at this time.