Skip to content

Commit

Permalink
Parse pipelines (#150)
Browse files Browse the repository at this point in the history
* added Parse struct

* implemented parsers

* refactored code for code quality

* added Parse struct

* implemented parsers

* refactored code for code quality

* Removed restrictive Parse struct

* added docs
  • Loading branch information
Ziinc committed Jan 26, 2021
1 parent 648c406 commit 1ec03d7
Show file tree
Hide file tree
Showing 7 changed files with 270 additions and 85 deletions.
85 changes: 52 additions & 33 deletions documentation/basic_concepts.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Basic Concepts

---

## Flow from Request, Response, Parsed Item

Data is fetched in a linear series of operations.

1. New `Request`s is formed through `Crawly.Spider.init/0`.
Expand All @@ -10,7 +12,6 @@ Data is fetched in a linear series of operations.
4. The `Spider` receives the response and parses the response, returning new `Request`s and new parsed items
5. Parsed items are post-processed individually. New `Request`s from the `Spider` goes to step 2


## Spiders

Spiders are modules which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site.
Expand All @@ -34,14 +35,14 @@ All items are processed sequentially and are processed by Item pipelines.
In order to make a working web crawler, all the behaviour callbacks need to be implemented.

`init()` - a part of the Crawly.Spider behaviour. This function should return a KVList which contains a `start_urls` entry with a list, which defines the starting requests made by Crawly. Alternatively you may provide `start_requests` if it's required
to prepare first requests on `init()`. Which might be useful if, for example, you
want to pass a session cookie to the starting request. Note: `start_requests` are
processed before start_urls.
** This callback is going to be deprecated in favour of init/1. For now the backwords
compatibility is kept with a help of macro which always generates `init/1`.
to prepare first requests on `init()`. Which might be useful if, for example, you
want to pass a session cookie to the starting request. Note: `start_requests` are
processed before start_urls.
\*\* This callback is going to be deprecated in favour of init/1. For now the backwords
compatibility is kept with a help of macro which always generates `init/1`.

`init(options)` same as `init/0` but also takes options (which can be passed from the engine during
the spider start).
`init(options)` same as `init/0` but also takes options (which can be passed from the engine during
the spider start).

`base_url()` - defines a base_url of the given Spider. This function is used in order to filter out all requests which are going outside of the crawled website.

Expand Down Expand Up @@ -129,9 +130,15 @@ Built-in middlewares:
Crawly.Middlewares.AutoCookiesManager
```

### Response Parsers

> **Item Pipelines:** a pipeline module that parses a fetcher's request. If declared, a spider's `c:Crawly.Spider.parse_item\1` callback is ignored. It is unused by default. It implements the `Crawly.Pipeline` behaviour.
Parsers allow for logic reuse when spiders parse a fetcher's response.

### Item Pipelines

> **Item Pipelines:** a pipeline module that modifies and pre-processes a scraped item.
> **Item Pipelines:** a pipeline module that modifies and pre-processes a scraped item. It implements the `Crawly.Pipeline` behaviour.
Built-in item pipelines:

Expand Down Expand Up @@ -173,14 +180,14 @@ defmodule MyCustomPipeline do
end
```


### Best Practices

The use of global configs is discouraged, hence one pass options through a tuple-based pipeline declaration where possible.

When storing information in the `state` map, ensure that the state is namespaced with the pipeline name, so as to avoid key clashing. For example, to store state from `MyEctoPipeline`, store the state on the key `:my_ecto_pipeline_my_state`.

### Custom Request Middlewares

#### Request Middleware Example - Add a Proxy

Following the [documentation](https://hexdocs.pm/httpoison/HTTPoison.Request.html) for proxy options of a request in `HTTPoison`, we can do the following:
Expand All @@ -206,14 +213,16 @@ defmodule MyApp.MyProxyMiddleware do
end
```


### Custom Item Pipelines

Item pipelines receives the parsed item (from the Spider) and performs post-processing on the item.

#### Storing Parsed Items

You can use custom item pipelines to save the item to custom storages.

##### Example - Ecto Storage Pipeline

In this example, we insert the scraped item into a table with Ecto. This example does not directly call `MyRepo.insert`, but delegates it to an application context function.

```elixir
Expand All @@ -233,50 +242,58 @@ end
```

#### Multiple Different Types of Parsed Items

If you need to selectively post-process different types of scraped items, you can utilize pattern-matching at the item pipeline level.

There are two general methods of doing so:

1. Struct-based pattern matching
```elixir
defmodule MyApp.MyCustomPipeline do
@impl Crawly.Pipeline
def run(%MyItem{} = item, state, _opts \\ []) do
# do something
end
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}

```elixir
defmodule MyApp.MyCustomPipeline do
@impl Crawly.Pipeline
def run(%MyItem{} = item, state, _opts \\ []) do
# do something
end
```
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}
end
```

2. Key-based pattern matching
```elixir
defmodule MyApp.MyCustomPipeline do
@impl Crawly.Pipeline
def run(%{my_item: my_item} = item, state, _opts \\ []) do
# do something
end
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}

```elixir
defmodule MyApp.MyCustomPipeline do
@impl Crawly.Pipeline
def run(%{my_item: my_item} = item, state, _opts \\ []) do
# do something
end
```
# do nothing if it does not match
def run(item, state, _opts), do: {item, state}
end
```

Use struct-based pattern matching when:

1. you want to utilize existing Ecto schemas
2. you have pre-defined structs that you want to conform to

Use key-based pattern matching when:

1. you want to process two or more related and inter-dependent items together
2. you want to bulk process multiple items for efficiency reasons. For example, processing the weather data for 365 days in one pass.

##### Caveats

When using the nested-key pattern matching method, the spider's `Crawly.Spider.parse_item/1` callback will need to return items with a single key (or a map with multiple keys, if doing related processing).

When using struct-based pattern matching with existing Ecto structs, you will need to do an intermediate conversion of the struct into a map before performing the insertion into the Ecto Repo. This is due to the underlying Ecto schema metadata still being attached to the struct before insertion.

##### Example - Multi-Item Pipelines With Pattern Matching

In this example, your spider scrapes a "blog post" and a "weather data" from a website.
We will use the key-based pattern matching approach to selectively post-process a blog post parsed item.


```elixir
# in MyApp.CustomSpider.ex
def parse_item(response):
Expand All @@ -286,8 +303,10 @@ def parse_item(response):
%{weather: [ january_weather, february_weather ]}
]}
```

Then, in the custom pipeline, we will pattern match on the `:blog_post` key, to ensure that we only process blog posts with this pipeline (and not weather data).
We then update the `:blog_post` key of the received item.

```elixir
defmodule MyApp.BlogPostPipeline do
@impl Crawly.Pipeline
Expand Down Expand Up @@ -328,7 +347,7 @@ See: https://splash.readthedocs.io/en/stable/api.html
You can try using Splash with Crawly in the following way:

1. Start splash locally (e.g. using a docker image):
` docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300`
`docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300`
2. Configure Crawly to use Splash:
`fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]}`
3. Now all your pages will be automatically rendered by Splash.
`fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]}`
3. Now all your pages will be automatically rendered by Splash.
106 changes: 66 additions & 40 deletions documentation/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,53 @@ config :crawly,

## Options

### middlewares :: [module()]

Defines a list of middlewares responsible for pre-processing requests. If any of the requests from the `Crawly.Spider` is not passing the middleware configuration on the `middlewares` key, it's dropped.

Refer to `Crawly.Pipeline` for more information on the structure of a middleware.

```elixir
# Example middlewares
config :crawly,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.RobotsTxt,
# With options
{Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] },
{Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}
]
```

### `parsers` :: [module()]

default: nil

By default, parsers are unused, and a `Crawly.Spider`'s `parse_item/1` will be called to parse a fetcher's response. However, it is possible utilize Crawly's built-in parsers or your own custom logic to parse responses from a fetcher.

> **IMPORTANT**: If set at the global, **ALL** spiders will not have their `parse_item/1` callback used. It is advised to declare these at the non-global level using spider-level settings overrides.
Each parser may have additional peer dependencies if used. Refer to the documentation for each parser to know the specific requirements for each.

Refer to `Crawly.Pipeline` for more information on the structure of a parser.

```elixir
config :crawly,
parsers: [
{Crawly.Parsers.ExtractRequests, selector: "button"},
]
```

### `pipelines` :: [module()]

default: []

Defines a list of pipelines responsible for pre processing all the scraped items. All items not passing any of the pipelines are dropped. If unset, all items are stored without any modifications.

Example configuration of item pipelines:
Refer to `Crawly.Pipeline` for more information on the structure of a pipeline.

```
```elixir
config :crawly,
pipelines: [
{Crawly.Pipelines.Validate, fields: [:id, :date]},
Expand All @@ -34,22 +72,6 @@ config :crawly,
]
```

### middlewares :: [module()]

Example middlewares
```elixir
config :crawly,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.RobotsTxt,
{Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] },
{Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}
]
```

Defines a list of middlewares responsible for pre-processing requests. If any of the requests from the `Crawly.Spider` is not passing the middleware, it's dropped.

### closespider_itemcount :: pos_integer() | :disabled

default: :disabled
Expand All @@ -62,7 +84,6 @@ default: nil (disabled by default)

Defines a minimal amount of items which needs to be scraped by the spider within the given timeframe (1m). If the limit is not reached by the spider - it will be stopped.


### concurrent_requests_per_domain :: pos_integer()

default: 4
Expand All @@ -78,26 +99,28 @@ NOTE: A worker's speed if often limited by the speed of the actual HTTP client a
### retry :: Keyword list

Allows to configure the retry logic. Accepts the following configuration options:
1) *retry_codes*: Allows to specify a list of HTTP codes which are treated as

1. _retry_codes_: Allows to specify a list of HTTP codes which are treated as
failed responses. (Default: [])

2) *max_retries*: Allows to specify the number of attempts before the request is
2. _max_retries_: Allows to specify the number of attempts before the request is
abandoned. (Default: 0)

3) *ignored_middlewares*: Allows to modify the list of processors for a given
requests when retry happens. (Will be required to avoid clashes with
3. _ignored_middlewares_: Allows to modify the list of processors for a given
requests when retry happens. (Will be required to avoid clashes with
Unique.Request middleware).

Example:
```
retry:
[
retry_codes: [400],
max_retries: 3,
ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
]

```
```
retry:
[
retry_codes: [400],
max_retries: 3,
ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
]
```

### fetcher :: atom()

Expand All @@ -109,7 +132,7 @@ Allows to specify a custom HTTP client which will be performing request to the c

default: /tmp

Set spider logs directory. All spiders have their own dedicated log file
Set spider logs directory. All spiders have their own dedicated log file
stored under the `log_dir` folder.

### port :: pos_integer()
Expand All @@ -131,6 +154,7 @@ It's possible to override most of the setting on a spider level. In order to do
it is required to define the `override_settings/0` callback in your spider.

For example:

```elixir
def override_settings() do
[
Expand All @@ -141,10 +165,12 @@ end
```

The full list of overridable settings:
- closespider_itemcount,
- closespider_timeout,
- concurrent_requests_per_domain,
- fetcher,
- retry,
- middlewares,
- pipelines

- closespider_itemcount,
- closespider_timeout,
- concurrent_requests_per_domain,
- fetcher,
- retry,
- pasers,
- middlewares (has known [bugs](https://github.com/oltarasenko/crawly/issues/138))
- pipelines
5 changes: 4 additions & 1 deletion lib/crawly/parsed_item.ex
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
defmodule Crawly.ParsedItem do
@moduledoc """
Defines the structure of spider's result
Defines the structure of spider's result.
## Usage with Parsers
A `%ParsedItem{}` is piped through each parser pipeline module when it is declared. Refer to `Crawly.Pipeline` for further documentation.
"""

defstruct items: [], requests: []
Expand Down
Loading

0 comments on commit 1ec03d7

Please sign in to comment.