Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

HTTP request header support #48

Open
wants to merge 1 commit into from

5 participants

@ashmckenzie

Hi,

I needed to add some additional HTTP request headers and didn't see any support for that currently. Happy to change anything / add more specs to cover the change.

Ash.

@jgnagy

I agree that this should be in there... please accept this so everyone can benefit! :-)

@brutuscat

@ashmckenzie would you please submit a new PR in the new fork called Medusa?
Instead of using Net::HTTP uses OpenURI, which means that the headers should be passed in the options argument as seen in the http.rb. Just make sure that the keys are Strings. See OpenURI option docs.

options must be a hash.
Each option with a string key specifies an extra header field for HTTP. I.e., it is ignored for FTP without HTTP proxy.
The hash may include other options, where keys are symbols

@atgs-ghayakawa

:+1:
Thank you!
I was allowed used in crawling require Basic authentication page .

Sample:

require 'anemone'
require 'base64'

url = "http://exsample.com/test.htm"
auth_base64 = Base64.encode64('USER:PASSWORD').gsub(/\n/, "")
headers = {"Authorization" => "Basic #{auth_base64}"}

Anemone.crawl(url, {:http_request_headers => headers}) do |anemone|
  anemone.on_every_page do |page|
    puts page.url
    puts page.body.toutf8
  end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Feb 13, 2012
  1. HTTP request header support

    Ash McKenzie authored
This page is out of date. Refresh to see the latest.
View
4 lib/anemone/core.rb
@@ -55,7 +55,9 @@ class Core
# proxy server port number
:proxy_port => false,
# HTTP read timeout in seconds
- :read_timeout => nil
+ :read_timeout => nil,
+ # HTTP request headers
+ :http_request_headers => {}
}
# Create setter methods for all options to be called from the crawl block
View
9 lib/anemone/http.rb
@@ -95,6 +95,13 @@ def read_timeout
@opts[:read_timeout]
end
+ #
+ # HTTP request headers
+ #
+ def http_request_headers
+ @opts[:http_request_headers] || {}
+ end
+
private
#
@@ -136,6 +143,8 @@ def get_response(url, referer = nil)
req = Net::HTTP::Get.new(full_path, opts)
# HTTP Basic authentication
req.basic_auth url.user, url.password if url.user
+ # HTTP request headers
+ http_request_headers.each { |header, value| req.add_field(header, value) }
response = connection(url).request(req)
finish = Time.now()
response_time = ((finish - start) * 1000).round
View
6 spec/core_spec.rb
@@ -305,7 +305,8 @@ module Anemone
:discard_page_bodies => true,
:user_agent => 'test',
:obey_robots_txt => true,
- :depth_limit => 3)
+ :depth_limit => 3,
+ :http_request_headers => { 'Accept' => 'text/html'})
core.opts[:verbose].should == false
core.opts[:threads].should == 2
@@ -314,6 +315,7 @@ module Anemone
core.opts[:user_agent].should == 'test'
core.opts[:obey_robots_txt].should == true
core.opts[:depth_limit].should == 3
+ core.opts[:http_request_headers].should == { 'Accept' => 'text/html' }
end
it "should accept options via setter methods in the crawl block" do
@@ -324,6 +326,7 @@ module Anemone
a.user_agent = 'test'
a.obey_robots_txt = true
a.depth_limit = 3
+ a.http_request_headers = { 'Accept' => 'text/html'}
end
core.opts[:verbose].should == false
@@ -333,6 +336,7 @@ module Anemone
core.opts[:user_agent].should == 'test'
core.opts[:obey_robots_txt].should == true
core.opts[:depth_limit].should == 3
+ core.opts[:http_request_headers].should == { 'Accept' => 'text/html' }
end
it "should use 1 thread if a delay is requested" do
View
7 spec/http_spec.rb
@@ -14,6 +14,13 @@ module Anemone
http.fetch_page(SPEC_DOMAIN).should be_an_instance_of(Page)
end
+ it "should set the nominated HTTP request headers" do
+ net_http_get = double('Net::HTTP:Get')
+ Net::HTTP::Get.stub(:new).and_return(net_http_get)
+ net_http_get.should_receive(:add_field).with('Accept', 'text/html')
+ http = Anemone::HTTP.new({ :http_request_headers => { 'Accept' => 'text/html' }})
+ http.fetch_page(SPEC_DOMAIN)
+ end
end
end
end
Something went wrong with that request. Please try again.