Bypass Cloudflare DDoS protection when fetching feeds

Sometimes when a page behind Cloudflare is under too much traffic, Cloudflare responds with a specially crafted page; this may happen with the feed fetch url as well, if it's behind Cloudflare. The response has an HTTP 503 error code, which RestClient and other simple (non-js enabled) HTTP clients interpret as an unrecoverable error, stopping them from actually getting the feed. However the returned page has a js that after some time makes certain validations (not sure about the details) and finally redirects the browser to the actual page requested. So, to fetch pages that have this DDoS protection active, we must: - detect that FeedBunch is getting a 503 error but the page contains Cloudflare code - in this case, try fetching the feed again but using a full-featured browser (chrome-headless) so that the js code can run The page fetched with chrome-headless is then treated the same as we would with a successful RestClient response. We use chrome-headless only if FeedBunch thinks a page is behind Cloudflare DDoS protection, because it's slower and more resource-intensive than using a simple HTTP client like RestClient.
amatriain · Apr 5, 2018 · 5a0c535 · 5a0c535
1 parent 1a708f6
commit 5a0c535
Show file tree

Hide file tree

Showing 2 changed files with 37 additions and 5 deletions.
diff --git a/Gemfile b/Gemfile
@@ -51,6 +51,9 @@ gem 'ox'
 # HTTP client
 gem 'rest-client'
 
+# headless browser, to fetch some websites behing cloudflare DDOS protection
+gem 'selenium-webdriver'
+
 # HTTP client-side caching
 gem 'rest-client-components'
 gem 'rack-cache'
@@ -137,7 +140,6 @@ group :test do
 
   # To simulate a user's browser during acceptance testing
   gem 'capybara'
-  gem 'selenium-webdriver'
 
   # To be able to open the browser during debugging of acceptance tests
   gem 'launchy'

diff --git a/app/clients/feed_client.rb b/app/clients/feed_client.rb
@@ -99,10 +99,40 @@ def self.fetch_valid_feed(feed, http_caching, perform_autodiscovery)
       RestClient.disable Rack::Cache
     end
 
-    # GET the feed
-    Rails.logger.info "Fetching from URL #{url}"
-
-    feed_response = RestClient.get url, user_agent: user_agent
+    begin
+      # try to GET the feed with a simple HTTP client (js not enabled)
+      Rails.logger.info "Fetching from URL #{url}"
+      feed_response = RestClient.get url, user_agent: user_agent
+    rescue RestClient::ServiceUnavailable => e
+      # try to overcome Cloudflare DDoS protection with a full-featured headless browser
+      # Cloudflare sends a 503 error but with a js in the page that after a delay redirects to the actual requested page
+      if e.http_code == 503 && e.response.match?(/Cloudflare/i)
+        begin
+          Rails.logger.info "URL #{url} is behind Cloudflare DDoS protection, using a full browser to fetch it"
+          opts = Selenium::WebDriver::Chrome::Options.new
+          opts.add_argument '--headless'
+          browser = Selenium::WebDriver.for :chrome, options: opts
+          browser.get url
+          wait = Selenium::WebDriver::Wait.new timeout: 20
+          wait.until {
+            # wait until the page in the browser is an RSS or Atom feed, or a timeout happens
+            browser.find_element :xpath, '//rss|//feed'
+          }
+          feed_response = browser.page_source
+          # some methods necessary later, to emulate a RestClient response
+          feed_response.define_singleton_method :headers do
+            return []
+          end
+        rescue Selenium::WebDriver::Error::TimeOutError => eTimeout
+          Rails.logger.info "Cannot access URL #{url} behind Cloudflare DDoS protection even with a full browser"
+          # if after all the full browser cannot get the feed, raise the original error returned to RestClient
+          raise e
+        end
+      else
+        # if a HTTP 503 error is received but the page is not a Cloudflare DDoS protection page, raise the error as usual
+        raise e
+      end
+    end
 
     # If the response was retrieved from the cache, do not process it (entries are already in the db)
     if http_caching