From 6eab6f1700dd26291fd861062eab8fcd73a96f07 Mon Sep 17 00:00:00 2001 From: Oleg Tarasenko Date: Mon, 17 Feb 2020 13:14:31 +0100 Subject: [PATCH] Add browser rendering support Added a fetcher with uses a Splash browser renderer in order to fetch pages with it. --- README.md | 9 +++++ documentation/basic_concepts.md | 32 ++++++++++++++++ lib/crawly/fetchers/splash.ex | 66 +++++++++++++++++++++++++++++++++ 3 files changed, 107 insertions(+) create mode 100644 lib/crawly/fetchers/splash.ex diff --git a/README.md b/README.md index 1f5ff3a9..08df168c 100644 --- a/README.md +++ b/README.md @@ -86,6 +86,15 @@ historical archival. - `iex(1)> Crawly.Engine.start_spider(EslSpider)` 6. Results can be seen with: `$ cat /tmp/EslSpider.csv` +## Browser rendering + +Crawly can be configured in the way that all fetched pages will be browser rendered, +which can be very useful if you need to extract data from pages which has lots +of asynchronous elements (for example parts loaded by AJAX). + +You can read more here: +- [Browser Rendering](https://hexdocs.pm/crawly/basic_concepts.html#browser-rendering) + ## Documentation - [API Reference](https://hexdocs.pm/crawly/api-reference.html#content) diff --git a/documentation/basic_concepts.md b/documentation/basic_concepts.md index b4cfc8ca..4e00847a 100644 --- a/documentation/basic_concepts.md +++ b/documentation/basic_concepts.md @@ -292,3 +292,35 @@ defmodule MyApp.BlogPostPipeline do def run(item, state, _opts), do: {item, state} end ``` + +## Browser rendering + +Browser rendering is one of the most complex problems of the scraping. The Internet +moves towards more dynamic content, where not only parts of the pages are loaded +asynchronously, but entire applications might be rendered by the JavaScript and +AJAX. + +In most of the cases it's still possible to extract the data from dynamically +rendered pages. (E.g. by sending additional POST requests from loaded pages), +however this approach seems to have visible drawbacks. From our point of view +it makes the spider code quite complicated and fragile. + +Of course it's good when you can just get pages already rendered for you. And we're +solving this problem with a help of pluggable HTTP fetchers. + +Crawly's codebase contains a special Splash fetcher, which allows to do the browser +rendering before the page content is being parsed by a spider. Also it's possible +to build own fetchers. + +### Using splash fetcher for browser rendering + +Splash is a lightweight opensourse browser implementation built with QT and python. +See: https://splash.readthedocs.io/en/stable/api.html + +You can try using Splash with Crawly in the following way: + +1. Start splash locally (e.g. using a docker image): +` docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300` +2. Configure Crawly to use Splash: +`fetcher: {Crawly.Fetchers.Splash, [base_splash_url: "http://localhost:8050/render.html"]}` +3. Now all your pages will be automatically rendered by Splash. \ No newline at end of file diff --git a/lib/crawly/fetchers/splash.ex b/lib/crawly/fetchers/splash.ex new file mode 100644 index 00000000..fbf5c7be --- /dev/null +++ b/lib/crawly/fetchers/splash.ex @@ -0,0 +1,66 @@ +defmodule Crawly.Fetchers.Splash do + @moduledoc """ + Implements Crawly.Fetchers.Fetcher behavior for Splash Javascript rendering. + + Splash is a lightweight QT based Javascript rendering engine. See: + https://splash.readthedocs.io/ + + Splash exposes the render.html endpoint which renders incoming requests sent + with ?url get parameter. + + This particular Splash fetcher converts all requests made by Crawly to Splash + requests, and cleans up the final responses, by removing the Splash parts + from the response. + + It's possible to start splash severver in any documented way. One of the options + is to run it locally with a help of docker: + docker run -it -p 8050:8050 scrapinghub/splash + + In this case you have to confugure the fetcher in the following way: + `fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]},` + """ + @behaviour Crawly.Fetchers.Fetcher + + require Logger + + @spec fetch(request, client_options) :: response + when request: Crawly.Request.t(), + client_options: [binary()], + response: Crawly.Response.t() + def fetch(request, client_options) do + {base_url, other_options} = + case Keyword.pop(client_options, :base_url, nil) do + nil -> + throw( + "The base_splash_url is not set. Splash fetcher can't be used! " <> + "Please set :base_url in fetcher options to continue. " <> + "For example: " <> + "fetcher: {Crawly.Fetchers.Splash, [base_url: ]}" + ) + + {base_url, other_options} -> + {base_url, other_options} + end + + query_parameters = URI.encode_query(Keyword.put(other_options, :url, request.url)) + + url = + URI.merge(base_url, "?" <> query_parameters) + |> URI.to_string() + + case HTTPoison.get(url, request.headers, request.options) do + {:ok, response} -> + new_request = %HTTPoison.Request{response.request | url: request.url} + + new_response = %HTTPoison.Response{ + response | + request: new_request, + request_url: request.url + } + {:ok, new_response} + + error -> + error + end + end +end