Skip to content

Commit

Permalink
Bypass Cloudflare DDoS protection when fetching feeds
Browse files Browse the repository at this point in the history
Sometimes when a page behind Cloudflare is under too much traffic,
Cloudflare responds with a specially crafted page; this may happen with
the feed fetch url as well, if it's behind Cloudflare.

The response has an HTTP 503 error code, which RestClient and other
simple (non-js enabled) HTTP clients interpret as an unrecoverable
error, stopping them from actually getting the feed.

However the returned page has a js that after some time makes certain
validations (not sure about the details) and finally redirects the
browser to the actual page requested. So, to fetch pages that have this
DDoS protection active, we must:

- detect that FeedBunch is getting a 503 error but the page contains
Cloudflare code
- in this case, try fetching the feed again but using a full-featured
browser (chrome-headless) so that the js code can run

The page fetched with chrome-headless is then treated the same as we
would with a successful RestClient response.

We use chrome-headless only if FeedBunch thinks a page is behind
Cloudflare DDoS protection, because it's slower and more
resource-intensive than using a simple HTTP client like RestClient.
  • Loading branch information
amatriain committed Apr 5, 2018
1 parent 1a708f6 commit 5a0c535
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 5 deletions.
4 changes: 3 additions & 1 deletion Gemfile
Expand Up @@ -51,6 +51,9 @@ gem 'ox'
# HTTP client
gem 'rest-client'

# headless browser, to fetch some websites behing cloudflare DDOS protection
gem 'selenium-webdriver'

# HTTP client-side caching
gem 'rest-client-components'
gem 'rack-cache'
Expand Down Expand Up @@ -137,7 +140,6 @@ group :test do

# To simulate a user's browser during acceptance testing
gem 'capybara'
gem 'selenium-webdriver'

# To be able to open the browser during debugging of acceptance tests
gem 'launchy'
Expand Down
38 changes: 34 additions & 4 deletions app/clients/feed_client.rb
Expand Up @@ -99,10 +99,40 @@ def self.fetch_valid_feed(feed, http_caching, perform_autodiscovery)
RestClient.disable Rack::Cache
end

# GET the feed
Rails.logger.info "Fetching from URL #{url}"

feed_response = RestClient.get url, user_agent: user_agent
begin
# try to GET the feed with a simple HTTP client (js not enabled)
Rails.logger.info "Fetching from URL #{url}"
feed_response = RestClient.get url, user_agent: user_agent
rescue RestClient::ServiceUnavailable => e
# try to overcome Cloudflare DDoS protection with a full-featured headless browser
# Cloudflare sends a 503 error but with a js in the page that after a delay redirects to the actual requested page
if e.http_code == 503 && e.response.match?(/Cloudflare/i)
begin
Rails.logger.info "URL #{url} is behind Cloudflare DDoS protection, using a full browser to fetch it"
opts = Selenium::WebDriver::Chrome::Options.new
opts.add_argument '--headless'
browser = Selenium::WebDriver.for :chrome, options: opts
browser.get url
wait = Selenium::WebDriver::Wait.new timeout: 20
wait.until {
# wait until the page in the browser is an RSS or Atom feed, or a timeout happens
browser.find_element :xpath, '//rss|//feed'
}
feed_response = browser.page_source
# some methods necessary later, to emulate a RestClient response
feed_response.define_singleton_method :headers do
return []
end
rescue Selenium::WebDriver::Error::TimeOutError => eTimeout
Rails.logger.info "Cannot access URL #{url} behind Cloudflare DDoS protection even with a full browser"
# if after all the full browser cannot get the feed, raise the original error returned to RestClient
raise e
end
else
# if a HTTP 503 error is received but the page is not a Cloudflare DDoS protection page, raise the error as usual
raise e
end
end

# If the response was retrieved from the cache, do not process it (entries are already in the db)
if http_caching
Expand Down

0 comments on commit 5a0c535

Please sign in to comment.