Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Add support for crawling subdomains #27

Open
wants to merge 1 commit into from

3 participants

@alexspeller

Merge changes to support subdomain crawling from runa@91559bd

@MaGonglei

This feature is very useful.
I think anemone should also support for printing out the external links, just print out it but not scan it in deep.
The link checker tool XENU (http://home.snafu.de/tilman/xenulink.html) has this feature.

@wokkaflokka

MaGonglei: It is very simple to gather external links using Anemone, and comparably simple to actually check these links to verify they are valid, etc. The 'on_every_page' block is very helpful in this regard.

If you'd like some code that does exactly what you are asking, I could send an example your way.

@MaGonglei

Hi,wokkaflokka,thanks for your reply.
I think I know what you mean,but I prefer to have this feature when I initialize the anemone crawl like :
Anemone.crawl("http://www.example.com",:external_links => false) do |anemone|
....
end

Because if I use the "on_every_page" block to search the external links (e.g. "page.doc.xpath '//a[@href]') ,it seemed cost too much CPU and Memorys.

If I'm wrong,give me the example.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
This page is out of date. Refresh to see the latest.
Showing with 36 additions and 12 deletions.
  1. +14 −2 lib/anemone/core.rb
  2. +10 −10 lib/anemone/page.rb
  3. +12 −0 spec/core_spec.rb
View
16 lib/anemone/core.rb
@@ -55,7 +55,9 @@ class Core
# proxy server port number
:proxy_port => false,
# HTTP read timeout in seconds
- :read_timeout => nil
+ :read_timeout => nil,
+ # Crawl subdomains?
+ :crawl_subdomains => false
}
# Create setter methods for all options to be called from the crawl block
@@ -72,6 +74,7 @@ class Core
def initialize(urls, opts = {})
@urls = [urls].flatten.map{ |url| url.is_a?(URI) ? url : URI(url) }
@urls.each{ |url| url.path = '/' if url.path.empty? }
+ @valid_domains = @urls.map{|u| [u.host,u.host.gsub(/^www\./,'.')]}.flatten.compact.uniq
@tentacles = []
@on_every_page_blocks = []
@@ -256,7 +259,16 @@ def visit_link?(link, from_page = nil)
!skip_link?(link) &&
!skip_query_string?(link) &&
allowed(link) &&
- !too_deep?(from_page)
+ !too_deep?(from_page) &&
+ (in_allowed_domain?(link) or in_allowed_subdomain?(link))
+ end
+
+ def in_allowed_domain?(link)
+ @valid_domains.index(link.host)
+ end
+
+ def in_allowed_subdomain?(link)
+ opts[:crawl_subdomains] and @valid_domains.find{|domain| link.host.end_with?(domain)}
end
#
View
20 lib/anemone/page.rb
@@ -63,7 +63,7 @@ def links
u = a['href']
next if u.nil? or u.empty?
abs = to_absolute(URI(u)) rescue next
- @links << abs if in_domain?(abs)
+ @links << abs
end
@links.uniq!
@links
@@ -130,7 +130,15 @@ def redirect?
def not_found?
404 == @code
end
-
+
+ #
+ # Returns +true+ if *uri* is in the same domain as the page, returns
+ # +false+ otherwise
+ #
+ def in_domain?(uri)
+ uri.host == @url.host
+ end
+
#
# Converts relative URL *link* into an absolute URL based on the
# location of the page
@@ -149,14 +157,6 @@ def to_absolute(link)
return absolute
end
- #
- # Returns +true+ if *uri* is in the same domain as the page, returns
- # +false+ otherwise
- #
- def in_domain?(uri)
- uri.host == @url.host
- end
-
def marshal_dump
[@url, @headers, @data, @body, @links, @code, @visited, @depth, @referer, @redirect_to, @response_time, @fetched]
end
View
12 spec/core_spec.rb
@@ -42,6 +42,18 @@ module Anemone
core.pages.keys.should_not include('http://www.other.com/')
end
+ it "should follow links to subdomains" do
+ pages = []
+ pages << FakePage.new('0', :links => ['1'], :hrefs => [ 'http://www.other.com/', 'http://subdomain.example.com/'] )
+ pages << FakePage.new('1')
+
+ core = Anemone.crawl(pages[0].url, @opts.merge({:crawl_subdomains => true}))
+
+ core.should have(3).pages
+ core.pages.keys.should_not include('http://www.other.com/')
+ core.pages.keys.should include('http://subdomain.example.com/')
+ end
+
it "should follow http redirects" do
pages = []
pages << FakePage.new('0', :links => ['1'])
Something went wrong with that request. Please try again.