Skip to content

dblock/ruby-link-checker

Repository files navigation

Fast Ruby Link Checker

Gem Version Build Status Code Climate Test Coverage

A fast Ruby link checker with support for multiple HTTP libraries. Does not parse documents, just checks links. Fast. Anecdotal benchmarking on a M1 mac and T1 Internet yields ~50 URLs per second with LinkChecker::Typhoeus::Hydra.

Table of Contents

Usage

Dependencies

The LinkChecker::Typhoeus::Hydra link checker is recommended.

Add typhoeus and ruby-link-checker to your Gemfile and run bundle install.

gem 'typhoeus'
gem 'ruby-link-checker'

Basic Usage

require 'typhoeus'
require 'ruby-link-checker'

# create a new checker instance
checker = LinkChecker::Typhoeus::Hydra::Checker.new

# queue URLs to check
links = [...]
links.each do |url|
  checker.check url
end

# run the checks
checker.run

# display buckets of results
checker.results.each_pair do |bucket, results|
  puts "#{bucket}: #{results.size}"
end

Passing Options

You can pipe custom options through check and retrieve them in events as follows.

checker.check 'https://www.example.org', { location: 'page.html' }

checker.on :success do |result|
  result.options # contains { location: 'page.html' }
end

Checkers

Fast link checker that uses Typhoeus.

require 'typhoeus'
require 'ruby-link-checker'

# create a new instance of a checker
checker = LinkChecker::Typhoeus::Hydra::Checker.new(
  hydra: {
    # lower than the Typhoeus default of 200, seems to start breaking around 50+
    max_concurrency: 25
  }
)

# log requests and response codes
checker.logger.level = Logger::INFO

links = [...] # array of URLs
links.each do |url|
  checker.check url
end

# examine failures and errors as they come
checker.on :error, :failure do |result|
  puts "FAIL: #{result}"
end    

# execute Hydra#run, will block until all requests have completed
checker.run

# examine results
checker.results.each_pair do |bucket, results|
  puts "#{bucket}: #{results.size}"
end

You can pass Typhoeus timeout options into a new instance of a checker, or configure timeouts globally.

LinkChecker::Typhoeus::Hydra.configure do |config|
  config.timeout = 5
  config.connecttimeout = 10
end

Slow, sequential checker.

require 'net/http'
require 'ruby-link-checker'

# create a new instance of a checker
checker = LinkChecker::Net::HTTP::Checker.new

# log requests and response codes
checker.logger.level = Logger::INFO

links = [...] # array of URLs
links.each do |url|
  checker.check url
end

# examine results
checker.results.each_pair do |bucket, results|
  puts "#{bucket}: #{results.size}"
end

You can pass Net::HTTP timeout options into a new instance of a checker, or configure timeouts globally.

LinkChecker::Net::HTTP.configure do |config|
  config.read_timeout = 5
  config.open_timeout = 10
end

Options

Retries

By default link checkers do not retry. You can set a number of times to retry all errors and failures with retries.

checker = LinkChecker::Net::HTTP::Checker.new(retry: 1)

Results

By default checkers collect results.

checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run

checker.results # => { error: [...], failure: [...], success: [...] }

You can disable this with results: false.

checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run

checker.results # => nil

Methods

By default checkers try a HEAD request, followed by a GET if HEAD fails. You can change this behavior by specifying other methods.

The following examples disables GET and only makes HEAD requests.

checker = LinkChecker::Net::HTTP::Checker.new(methods: %w[HEAD])

Logger

Pass your own logger.

checker = LinkChecker::Net::HTTP::Checker.new(logger: Logger.new(STDOUT))

User-Agent

Pass your own user-agent. Default is Ruby Link Checker/x.y.z.

checker = LinkChecker::Net::HTTP::Checker.new(user_agent: 'Custom Agent/1.0')

Global Configuration

All options can also be configured globally.

LinkChecker.configure do |config|
  config.user_agent = 'Custom Agent/1.0'
  config.methods = ['HEAD', 'GET']
  config.logger = ::Logger.new(STDOUT)
end

Callbacks and Events

Events enable processing of results as they become available.

checker.on :result do |result|
  puts result # any result
end

checker.on :error, :failure do |result|
  puts result # error or failure
end

Checkers support the following events.

Event Description
:retry A request is being retried on failure or error.
:result A new result, any of success, failure, or error.
:success A valid URL, usually a 2xx response from the server.
:failure A failed URL, usually a 4xx or a 5xx response from the server.
:error An error, such as an invalid URL or a network timeout.

Events are called with results, which contain the following properties.

Property Description
:url The original URL before redirects.
:result_url The last URL, different from url in case of redirects.
:method The result HTTP method.
:code HTTP error code.
:request_headers Request headers.
:redirect_to A redirect URL in case of redirects.
:error A raised error in case of errors.

See result.rb for more details.

Contributing

You're encouraged to contribute to ruby-link-checker. See CONTRIBUTING for details.

Copyright and License

Copyright (c) Daniel Doubrovkine and Contributors.

This project is licensed under the MIT License.

About

Fast ruby link checker.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages