/
consistencychecker.rb
102 lines (99 loc) · 3.46 KB
/
consistencychecker.rb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
# Basic consistency checker (finds 404s and 200s)
#
# Designed to serve as a control block for a Webcrawler, the ConsistencyChecker
# maintains state on which pages have been seen and the results of attempting
# to retrieve them. It generates a report that can be used to quickly discern
# the health of a website.
#
# Features of behavior:
# * logs 404s, 200s, and other HTTP results
# * logs, but does not crawl, non-HTTP hyperlinks
# * uses the provided linkfinder to parse each HTML page's body for more
# links to follow
# * tracks pages that were already crawled; does not crawl a page more than
# once
# * restricts crawling to the initial domain---outbound links are retrieved
# to verify they are not 404s, but are not crawled.
#
# ConsistencyChecker produces an output report in the results property.
# The report is a hash with specific keys; see the webcrawler-ruby REAME for
# more information.
#
# == Usage
#
# Usually, ConsistencyChecker is used as the control block for a Webcrawler;
# see the documentation of Webcheck for more information.
#
# You may also use it in a standalone fashion:
#
# require 'linkfinder'
# require 'uri'
#
# checker=ConsistencyChecker.new(Linkfinder.new,URI("http://example.com"))
# pageToCheck=URI("http://example.com/index.htm")
#
# myHTTPResponse=getAPage(pageToCheck)
# linksOnIndexPage = checker.check(pageToCheck,myHTTPResponse)
# results=checker.results
# assert results[:uris200].include? pageToCheck
# assert results[:checked].include? pageToCheck
# # If there were mailto links, they would show up in results[:urisNonHTTP]
# # linksOnIndexPage is an array of links retrieved from index.htm using
# # the specified linkfinder
class ConsistencyChecker
attr_reader :results
# Initialize a new ConsistencyChecker
#
# [linkfinder] The Linkfinder to use. See linkfinder.rb for more information
# [baseURI] An initial URI, which specifies the domain of the check. The
# ConsistencyChecker will not crawl resources outside of this
# domain. For convenience, you may wish to specify the URI of
# the first page that will be checked.
def initialize(linkfinder,baseURI)
@results={
:uris404=>[],
:uris200=>[],
:urisUnknown=>[],
:checked=>{},
:urisNonHTTP=>[]
}
@linkfinder=linkfinder
@baseURI=baseURI
end
# Check the page and update the buffers with information retrieved from
# the page
#
# [uri] The URI of the page we are checking
# [res] A net/HTTP response object generated by requesting the specified URI
#
# return:: list of URIs to check next
def check(uri,res)
@results[:checked][uri]=true
if res.code=="404"
@results[:uris404] << uri
return []
elsif res.code=="200"
@results[:uris200] << uri
if uri.host != @baseURI.host or res['content-type'] != 'text/html'
return []
end
links = @linkfinder.getLinks(res.body)
links=@linkfinder.convertToAbsolute(links,uri)
returnedLinks=[]
links.each {|link|
if link.scheme != "http"
@results[:urisNonHTTP] << link
else
if not @results[:checked].include? link
@results[:checked][link]=true
returnedLinks << link
end
end
}
return returnedLinks
else
@results[:urisUnknown] << {:code => res.code, :uri => uri}
end
return []
end
end