Skip to content

Commit

Permalink
Merge 6eab6f1 into cdec9f8
Browse files Browse the repository at this point in the history
  • Loading branch information
oltarasenko committed Feb 19, 2020
2 parents cdec9f8 + 6eab6f1 commit de4003b
Show file tree
Hide file tree
Showing 3 changed files with 107 additions and 0 deletions.
9 changes: 9 additions & 0 deletions README.md
Expand Up @@ -86,6 +86,15 @@ historical archival.
- `iex(1)> Crawly.Engine.start_spider(EslSpider)`
6. Results can be seen with: `$ cat /tmp/EslSpider.csv`

## Browser rendering

Crawly can be configured in the way that all fetched pages will be browser rendered,
which can be very useful if you need to extract data from pages which has lots
of asynchronous elements (for example parts loaded by AJAX).

You can read more here:
- [Browser Rendering](https://hexdocs.pm/crawly/basic_concepts.html#browser-rendering)

## Documentation

- [API Reference](https://hexdocs.pm/crawly/api-reference.html#content)
Expand Down
32 changes: 32 additions & 0 deletions documentation/basic_concepts.md
Expand Up @@ -292,3 +292,35 @@ defmodule MyApp.BlogPostPipeline do
def run(item, state, _opts), do: {item, state}
end
```

## Browser rendering

Browser rendering is one of the most complex problems of the scraping. The Internet
moves towards more dynamic content, where not only parts of the pages are loaded
asynchronously, but entire applications might be rendered by the JavaScript and
AJAX.

In most of the cases it's still possible to extract the data from dynamically
rendered pages. (E.g. by sending additional POST requests from loaded pages),
however this approach seems to have visible drawbacks. From our point of view
it makes the spider code quite complicated and fragile.

Of course it's good when you can just get pages already rendered for you. And we're
solving this problem with a help of pluggable HTTP fetchers.

Crawly's codebase contains a special Splash fetcher, which allows to do the browser
rendering before the page content is being parsed by a spider. Also it's possible
to build own fetchers.

### Using splash fetcher for browser rendering

Splash is a lightweight opensourse browser implementation built with QT and python.
See: https://splash.readthedocs.io/en/stable/api.html

You can try using Splash with Crawly in the following way:

1. Start splash locally (e.g. using a docker image):
` docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300`
2. Configure Crawly to use Splash:
`fetcher: {Crawly.Fetchers.Splash, [base_splash_url: "http://localhost:8050/render.html"]}`
3. Now all your pages will be automatically rendered by Splash.
66 changes: 66 additions & 0 deletions lib/crawly/fetchers/splash.ex
@@ -0,0 +1,66 @@
defmodule Crawly.Fetchers.Splash do
@moduledoc """
Implements Crawly.Fetchers.Fetcher behavior for Splash Javascript rendering.
Splash is a lightweight QT based Javascript rendering engine. See:
https://splash.readthedocs.io/
Splash exposes the render.html endpoint which renders incoming requests sent
with ?url get parameter.
This particular Splash fetcher converts all requests made by Crawly to Splash
requests, and cleans up the final responses, by removing the Splash parts
from the response.
It's possible to start splash severver in any documented way. One of the options
is to run it locally with a help of docker:
docker run -it -p 8050:8050 scrapinghub/splash
In this case you have to confugure the fetcher in the following way:
`fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]},`
"""
@behaviour Crawly.Fetchers.Fetcher

require Logger

@spec fetch(request, client_options) :: response
when request: Crawly.Request.t(),
client_options: [binary()],
response: Crawly.Response.t()
def fetch(request, client_options) do
{base_url, other_options} =
case Keyword.pop(client_options, :base_url, nil) do
nil ->
throw(
"The base_splash_url is not set. Splash fetcher can't be used! " <>
"Please set :base_url in fetcher options to continue. " <>
"For example: " <>
"fetcher: {Crawly.Fetchers.Splash, [base_url: <url>]}"
)

{base_url, other_options} ->
{base_url, other_options}
end

query_parameters = URI.encode_query(Keyword.put(other_options, :url, request.url))

url =
URI.merge(base_url, "?" <> query_parameters)
|> URI.to_string()

case HTTPoison.get(url, request.headers, request.options) do
{:ok, response} ->
new_request = %HTTPoison.Request{response.request | url: request.url}

new_response = %HTTPoison.Response{
response |
request: new_request,
request_url: request.url
}
{:ok, new_response}

error ->
error
end
end
end

0 comments on commit de4003b

Please sign in to comment.