Merge 6eab6f1 into cdec9f8

elixir-crawly · Feb 19, 2020 · de4003b · de4003b
2 parents cdec9f8 + 6eab6f1
commit de4003b
Show file tree

Hide file tree

Showing 3 changed files with 107 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -86,6 +86,15 @@ historical archival.
    - `iex(1)> Crawly.Engine.start_spider(EslSpider)`
 6. Results can be seen with: `$ cat /tmp/EslSpider.csv`
 
+## Browser rendering
+
+Crawly can be configured in the way that all fetched pages will be browser rendered,
+which can be very useful if you need to extract data from pages which has lots
+of asynchronous elements (for example parts loaded by AJAX).
+
+You can read more here:
+- [Browser Rendering](https://hexdocs.pm/crawly/basic_concepts.html#browser-rendering)
+
 ## Documentation
 
 - [API Reference](https://hexdocs.pm/crawly/api-reference.html#content)

diff --git a/documentation/basic_concepts.md b/documentation/basic_concepts.md
@@ -292,3 +292,35 @@ defmodule MyApp.BlogPostPipeline do
   def run(item, state, _opts), do: {item, state}
 end
 ```
+
+## Browser rendering
+
+Browser rendering is one of the most complex problems of the scraping. The Internet
+moves towards more dynamic content, where not only parts of the pages are loaded 
+asynchronously, but entire applications might be rendered by the JavaScript and
+AJAX.
+
+In most of the cases it's still possible to extract the data from dynamically 
+rendered pages. (E.g. by sending additional POST requests from loaded pages), 
+however this approach seems to have visible drawbacks. From our point of view
+it makes the spider code quite complicated and fragile.
+
+Of course it's good when you can just get pages already rendered for you. And we're
+solving this problem with a help of pluggable HTTP fetchers. 
+
+Crawly's codebase contains a special Splash fetcher, which allows to do the browser
+rendering before the page content is being parsed by a spider. Also it's possible
+to build own fetchers.
+
+### Using splash fetcher for browser rendering
+
+Splash is a lightweight opensourse browser implementation built with QT and python.
+See: https://splash.readthedocs.io/en/stable/api.html 
+
+You can try using Splash with Crawly in the following way:
+
+1. Start splash locally (e.g. using a docker image):
+` docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300`
+2. Configure Crawly to use Splash: 
+`fetcher: {Crawly.Fetchers.Splash, [base_splash_url: "http://localhost:8050/render.html"]}`
+3. Now all your pages will be automatically rendered by Splash.  
diff --git a/lib/crawly/fetchers/splash.ex b/lib/crawly/fetchers/splash.ex
@@ -0,0 +1,66 @@
+defmodule Crawly.Fetchers.Splash do
+  @moduledoc """
+  Implements Crawly.Fetchers.Fetcher behavior for Splash Javascript rendering.
+
+  Splash is a lightweight QT based Javascript rendering engine. See:
+  https://splash.readthedocs.io/
+
+  Splash exposes the render.html endpoint which renders incoming requests sent
+  with ?url get parameter.
+
+  This particular Splash fetcher converts all requests made by Crawly to Splash
+  requests, and  cleans up the final responses, by removing the Splash parts
+  from the response.
+
+  It's possible to start splash severver in any documented way. One of the options
+  is to run it locally with a help of docker:
+  docker run -it -p 8050:8050 scrapinghub/splash
+
+  In this case you have to confugure the fetcher in the following way:
+  `fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]},`
+  """
+  @behaviour Crawly.Fetchers.Fetcher
+
+  require Logger
+
+  @spec fetch(request, client_options) :: response
+        when request: Crawly.Request.t(),
+             client_options: [binary()],
+             response: Crawly.Response.t()
+  def fetch(request, client_options) do
+    {base_url, other_options} =
+      case Keyword.pop(client_options, :base_url, nil) do
+        nil ->
+          throw(
+            "The base_splash_url is not set. Splash fetcher can't be used! " <>
+            "Please set :base_url in fetcher options to continue. " <>
+            "For example: " <>
+            "fetcher: {Crawly.Fetchers.Splash, [base_url: <url>]}"
+          )
+
+        {base_url, other_options} ->
+          {base_url, other_options}
+      end
+
+    query_parameters = URI.encode_query(Keyword.put(other_options, :url, request.url))
+
+    url =
+      URI.merge(base_url, "?" <> query_parameters)
+      |> URI.to_string()
+
+    case HTTPoison.get(url, request.headers, request.options) do
+      {:ok, response} ->
+        new_request = %HTTPoison.Request{response.request | url: request.url}
+
+        new_response = %HTTPoison.Response{
+          response |
+          request: new_request,
+          request_url: request.url
+        }
+        {:ok, new_response}
+
+      error ->
+        error
+    end
+  end
+end