A website crawler / validator, designed to be integrated into an automated website integrity test. Inspired by the Python "Webcheck" project (http://arthurdejong.org/webcheck/)
Ruby
Switch branches/tags
Nothing to show
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
test
.gitignore
LICENSE
README.rdoc
TODO
consistencychecker.rb
linkfinder.rb
webcheck.rb
webcrawler.rb

README.rdoc

Webcheck-ruby

Webcheck-ruby is a simple web site crawler / checker, inspired by the Python program Webcheck, currently maintained by Arthur de Jong

Features

  • fully automated

  • identifies broken links, missing images, missing linked resources, and missing scripts

  • modular design allows for substitution of HTML link extractor, crawler, and page-validation rules

Usage

To crawl a web page:

require 'webcheck'
checker = Webcheck.new
results = checker.check("http://example.com")
print results[:uris404]
print results[:urisUnknown]
checker.prettyPrint(results)

The default format of the results is a hash with the following keys

uris404

Links that returned a 404 error

uris200

Links that were followed successfully

urisUnknown

Links that returned an unknown HTTP code. These are in the form of hashes with the keys :code,:uri

checked

A hash of all the URIs checked. The keys are the URIs

urisNonHTTP

Links on the page that led to resources not accessed via the HTTP protocol (such as e-mail and FTP)

Components

Webcrawler

A recursive crawling tool. At each step, it retrieves a resource from a URL and feeds it to a block. The block yields links to retrieve on subsequent steps.

ConsistencyChecker

The validation engine. Takes the HTML bodies retrieved by the crawler and determines how to handle each one.

Linkfinder

A link extraction tool, which takes HTML bodies and extracts URLs from them.

Dependencies

The Linkfinder uses Nokogiri to extract links from web pages.

Author

Copyright © 2010 by Mark T. Tomczak, released under the MIT license.