Skip to content

Commit

Permalink
Merge 2f3a1bd into 6b4874d
Browse files Browse the repository at this point in the history
  • Loading branch information
Ziinc committed Dec 25, 2019
2 parents 6b4874d + 2f3a1bd commit 837dfe6
Show file tree
Hide file tree
Showing 22 changed files with 587 additions and 101 deletions.
93 changes: 78 additions & 15 deletions documentation/basic_concepts.md
Expand Up @@ -77,28 +77,39 @@ The parsed item is being processed by Crawly.Worker process, which sends all req

For now only one Storage backend is supported (writing on disc). But in future Crawly will also support work with amazon S3, sql and others.

## The `Crawly.Pipeline` Behaviour.

Crawly is using a concept of pipelines when it comes to processing of the elements sent to the system. This is applied to both request and scraped item manipulation. Conceptually, requests go through a series of manipulations, before the response is fetched. The response then goes through another different series of manipulations.

Importantly, the way that requests and responses are manipulated are abstracted into the `Crawly.Pipeline` behaviour. This allows for a modular system for declaring changes. It is also to be noted that Each `Crawly.Pipeline` module, when declared, are applied sequentially through the `Crawly.Utils.pipe/3` function.

### Writing Tests for Custom Pipelines

Modules that implement the `Crawly.Pipeline` behaviour can make use of the `Crawly.Utils.pipe/3` function to test for expected behaviour. Refer to the function documentation for more information and examples.

## Request Middlewares

These are configured under the `middlewares` option. See [configuration](./configuration.md) for more details.

> **Middleware:** A pipeline module that modifies a request. It implements the `Crawly.Pipeline` behaviour.
List of built-in middlewares:
Middlewares are able to make changes to the underlying request, a `Crawly.Request` struct. The request, along with any options specified, is then passed to the fetcher (currently `HTTPoison`).
The available configuration options should correspond to the underlying options of the fetcher in use.

Note that all request configuration options for `HTTPoison`, such as proxy, ssl, etc can be configured through `Crawly.Request.options`.

Built-in middlewares:

1. `Crawly.Middlewares.DomainFilter` - this middleware will disable scheduling for all requests leading outside of the crawled site.
2. `Crawly.Middlewares.RobotsTxt` - this middleware ensures that Crawly respects the robots.txt defined by the target website.
3. `Crawly.Middlewares.UniqueRequest` - this middleware ensures that crawly would not schedule the same URL(request) multiple times.
4. `Crawly.Middlewares.UserAgent` - this middleware is used to set a User Agent HTTP header. Allows to rotate UserAgents, if the last one is defined as a list.

### Creating a Custom Request Middleware

TODO

## Item Pipelines
### Item Pipelines

Crawly is using a concept of pipelines when it comes to processing of the elements sent to the system. In this section we will cover the topic of item pipelines - a tool which is used in order to pre-process items before storing them in the storage.
> **Item Pipelines:** a pipeline module that modifies and pre-processes a scraped item.
At this point Crawly includes the following Item pipelines:
Built-in item pipelines:

1. `Crawly.Pipelines.Validate` - validates that a given item has all the required fields. All items which don't have all required fields are dropped.
2. `Crawly.Pipelines.DuplicatesFilter` - filters out items which are already stored the system.
Expand All @@ -108,22 +119,48 @@ At this point Crawly includes the following Item pipelines:

The list of item pipelines used with a given project is defined in the project settings.

### Creating a Custom Item Pipeline
## Creating a Custom Pipeline Module

Both item pipelines and request middlewares follows the `Crawly.Pipeline` behaviour. As such, when creating your custom pipeline, it will need to implement the required callback `c:Crawly.Pipeline.run\3`.

The `c:Crawly.Pipeline.run\3` callback receives the processed item, `item` from the previous pipeline module as the first argument. The second argument, `state`, is a map containing information such as spider which the item originated from (under the `:spider_name` key), and may optionally store pipeline information. Finally, `opts` is a keyword list containing any tuple-based options.

### Passing Configuration Options To Your Pipeline

Tuple-based option declaration is supported, similar to how a `GenServer` is declared in a supervision tree. This allows for pipeline reusability for different use cases.

For example, you can pass options in this way through your pipeline declaration:

```elixir
pipelines: [
{MyCustomPipeline, my_option: "value"}
]
```

In your pipeline, you will then receive the options passed through the `opts` argument.

An item pipeline follows the `Crawly.Pipeline` behaviour. As such, when creating your custom pipeline, it will need to implement the required callback `c:Crawly.Pipeline.run\2`.
```elixir
defmodule MyCustomPipeline do
@impl Crawly.Pipeline
def run(item, state, opts) do
IO.inspect(opts) # shows keyword list of [ my_option: "value" ]
# Do something
end
end
```

> **Note**: [PR #31](https://github.com/oltarasenko/crawly/pull/31) aims to allow tuple-based option declaration, similar to how a `GenServer` is declared ina supervision tree.
### Best Practices

The `c:Crawly.Pipeline.run\2` callback receives the processed item, `item` from the previous pipeline module as the first argument. The second argument, `state`, is a map containing information such as spider which the item originated from (under the `:spider_name` key), and may optionally store pipeline information.
The use of global configs is discouraged, hence one pass options through a tuple-based pipeline declaration where possible.

When storing information in the `state` map, ensure that the state is namespaced with the pipeline name, so as to avoid key clashing. For example, to store state from `MyEctoPipeline`, store the state on the key `:my_ecto_pipeline_my_state`.

#### Example - Ecto Storage Pipeline
### Item Pipeline Example - Ecto Storage Pipeline

```elxiir
```elixir
defmodule MyApp.MyEctoPipeline do
@impl Crawly.Pipeline
def run(item, state) do
def run(item, state, _opts \\ []) do
case MyApp.insert_with_ecto(item) do
{:ok, _} ->
# insert successful, carry on with pipeline
Expand All @@ -135,3 +172,29 @@ defmodule MyApp.MyEctoPipeline do
end
end
```

### Request Middleware Example - Add a Proxy

Following the [documentation](https://hexdocs.pm/httpoison/HTTPoison.Request.html) for proxy options of a request in `HTTPoison`, we can do the following:

```elixir
defmodule MyApp.MyProxyMiddleware do
@impl Crawly.Pipeline
def run(request, state, opts \\ []) do
# Set default proxy and proxy_auth to nil
opts = Enum.into(opts, %{proxy: nil, proxy_auth: nil})

case opts.proxy do
nil ->
# do nothing
{request, state}
value ->
old_options = request.options
new_options = [proxy: opts.proxy, proxy_auth: opts.proxy_auth]
new_request = Map.put(request, :options, old_optoins ++ new_options)
{new_request, state}
end

end
end
```
38 changes: 25 additions & 13 deletions documentation/configuration.md
Expand Up @@ -6,10 +6,12 @@ A basic example:

```elixir
config :crawly,
# Item definition
item: [:title, :author, :time, :url],
# Identifier which is used to filter out duplicates
item_id: :title
pipelines: [
# my pipelines
]
middlewares: [
# my middlewares
]
```

## Options
Expand All @@ -32,6 +34,8 @@ by the `Crawly.Middlewares.UserAgent` middleware. When the list has more than on
item, all requests will be executed, each with a user agent string chosen
randomly from the supplied list.

> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Middlewares.UserAgent` module documentation for correct usage.
### `item` :: [atom()]

default: []
Expand All @@ -41,6 +45,8 @@ fields are added to the following item (or if the values of
required fields are "" or nil), the item will be dropped. This setting
is used by the `Crawly.Pipelines.Validate` pipeline

> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Pipelines.Validate` module documentation for correct usage.
### `item_id` :: atom()

default: nil
Expand All @@ -51,6 +57,8 @@ field is the SKU. This setting is used in
the `Crawly.Pipelines.DuplicatesFilter` pipeline. If unset, the related
middleware is effectively disabled.

> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Pipelines.DuplicatesFilter` module documentation for correct usage.
### `pipelines` :: [module()]

default: []
Expand All @@ -62,21 +70,25 @@ Example configuration of item pipelines:
```
config :crawly,
pipelines: [
Crawly.Pipelines.Validate,
Crawly.Pipelines.DuplicatesFilter,
{Crawly.Pipelines.Validate, fields: [:id, :date]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :id},
Crawly.Pipelines.JSONEncoder,
Crawly.Pipelines.WriteToFile # NEW IN 0.6.0
{Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.6.0
]
```

### middlewares :: [module()]

default: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.RobotsTxt,
Crawly.Middlewares.UserAgent
]
```elixir
The default middlewares are as follows:
config :crawly,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.RobotsTxt,
{Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] }
]
```

Defines a list of middlewares responsible for pre-processing requests. If any of the requests from the `Crawly.Spider` is not passing the middleware, it's dropped.

Expand Down
11 changes: 4 additions & 7 deletions documentation/quickstart.md
Expand Up @@ -58,19 +58,16 @@ Goals:
concurrent_requests_per_domain: 8,
follow_redirects: true,
closespider_itemcount: 1000,
output_format: "csv",
item: [:title, :url],
item_id: :title,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
Crawly.Pipelines.Validate,
Crawly.Pipelines.DuplicatesFilter,
Crawly.Pipelines.CSVEncoder,
Crawly.Pipelines.WriteToFile
{Crawly.Pipelines.Validate, fields: [:title, :url]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :title },
{Crawly.Pipelines.CSVEncoder, fields: [:title, :url}],
{Crawly.Pipelines.WriteToFile, extension: "csv", folder: "/tmp" }
]
```
5. Start the Crawl:
Expand Down
15 changes: 13 additions & 2 deletions lib/crawly/middlewares/domain_filter.ex
@@ -1,12 +1,23 @@
defmodule Crawly.Middlewares.DomainFilter do
@moduledoc """
Filters out requests which are going outside of the crawled domain
Filters out requests which are going outside of the crawled domain.
The domain that is used to compare against the request url is obtained from the spider's `c:Crawly.Spider.base_url` callback.
Does not accept any options. Tuple-based configuration optionswill be ignored.
### Example Declaration
```
middlewares: [
Crawly.Middlewares.DomainFilter
]
```
"""

@behaviour Crawly.Pipeline
require Logger

def run(request, state) do
def run(request, state, _opts \\ []) do
base_url = state.spider_name.base_url()

case String.contains?(request.url, base_url) do
Expand Down
18 changes: 11 additions & 7 deletions lib/crawly/middlewares/robotstxt.ex
Expand Up @@ -6,20 +6,24 @@ defmodule Crawly.Middlewares.RobotsTxt do
crawler can or can't request from your site. This is used mainly to avoid
overloading a site with requests!
Please NOTE:
The first rule of web crawling is you do not harm the website.
The second rule of web crawling is you do NOT harm the website
No options are required for this middleware. Any tuple-based configurations options passed will be ignored.
### Example Declaration
```
middlewares: [
Crawly.Middlewares.RobotsTxt
]
```
"""

@behaviour Crawly.Pipeline
require Logger

def run(request, state) do
def run(request, state, _opts \\ []) do
case Gollum.crawlable?("Crawly", request.url) do
:uncrawlable ->
Logger.debug(
"Dropping request: #{request.url} (robots.txt filter)"
)
Logger.debug("Dropping request: #{request.url} (robots.txt filter)")

{false, state}

Expand Down
22 changes: 20 additions & 2 deletions lib/crawly/middlewares/user_agent.ex
Expand Up @@ -4,16 +4,34 @@ defmodule Crawly.Middlewares.UserAgent do
:crawly, :user_agents sessions.
The default value for the user agent is: Crawly Bot 1.0
Rotation is determined through `Enum.random/1`.
### Options
- `:user_agents`, optional. A list of user agent strings to rotate. Defaults to "Crawly Bot 1.0".
### Example Declaration
```
middlewares: [
{UserAgent, user_agents: ["My Custom Bot] }
]
```
"""
require Logger

def run(request, state) do
def run(request, state, opts \\ []) do
opts = Enum.into(opts, %{user_agents: nil})

new_headers = List.keydelete(request.headers, "User-Agent", 0)
user_agents = Application.get_env(:crawly, :user_agents, ["Crawly Bot 1.0"])

user_agents =
Map.get(opts, :user_agents) ||
Application.get_env(:crawly, :user_agents, ["Crawly Bot 1.0"])

useragent = Enum.random(user_agents)

new_request =
Map.put(request, :headers, [{"User-Agent", useragent} | new_headers])

{new_request, state}
end
end
9 changes: 8 additions & 1 deletion lib/crawly/pipeline.ex
Expand Up @@ -5,10 +5,17 @@ defmodule Crawly.Pipeline do
A pipeline is a module which takes a given item, and executes a
run callback on a given item.
A state variable is used to share common information accros multiple
A state argument is used to share common information accros multiple
items.
An `opts` argument is used to pass configuration to the pipeline through tuple-based declarations.
"""
@callback run(item :: map, state :: map()) ::
{new_item :: map, new_state :: map}
| {false, new_state :: map}

@callback run(item :: map, state :: map(), args :: list(any())) ::
{new_item :: map, new_state :: map}
| {false, new_state :: map}
@optional_callbacks run: 3
end
26 changes: 23 additions & 3 deletions lib/crawly/pipelines/csv_encoder.ex
@@ -1,13 +1,33 @@
defmodule Crawly.Pipelines.CSVEncoder do
@moduledoc """
Encodes a given item (map) into CSV
Encodes a given item (map) into CSV. Does not flatten nested maps.
### Options
If no fields are given, the item is dropped from the pipeline.
- `:fields`, required: The fields to extract out from the scraped item. Falls back to the global config `:item`.
### Example Usage
iex> item = %{my: "first", other: "second", ignore: "this_field"}
iex> Crawly.Pipelines.CSVEncoder.run(item, %{}, fields: [:my, :other])
{"first,second", %{}}
"""
@behaviour Crawly.Pipeline
require Logger

@impl Crawly.Pipeline
def run(item, state) do
case Application.get_env(:crawly, :item) do
@spec run(map, map, fields: list(atom)) ::
{false, state :: map} | {csv_line :: String.t(), state :: map}
def run(item, state, opts \\ []) do
opts = Enum.into(opts, %{fields: nil})
fields = Map.get(opts, :fields) || Application.get_env(:crawly, :item)

case fields do
:undefined ->
# only for when both tuple and global config is not provided

Logger.info(
"Dropping item: #{inspect(item)}. Reason: No fields declared for CSVEncoder"
)

{false, state}

fields ->
Expand Down

0 comments on commit 837dfe6

Please sign in to comment.