Parse pipelines (#150)

* added Parse struct * implemented parsers * refactored code for code quality * added Parse struct * implemented parsers * refactored code for code quality * Removed restrictive Parse struct * added docs
elixir-crawly · Jan 26, 2021 · 1ec03d7 · 1ec03d7
1 parent 648c406
commit 1ec03d7
Show file tree

Hide file tree

Showing 7 changed files with 270 additions and 85 deletions.
diff --git a/documentation/basic_concepts.md b/documentation/basic_concepts.md
@@ -1,7 +1,9 @@
 # Basic Concepts
 
 ---
+
 ## Flow from Request, Response, Parsed Item
+
 Data is fetched in a linear series of operations.
 
 1. New `Request`s is formed through `Crawly.Spider.init/0`.
@@ -10,7 +12,6 @@ Data is fetched in a linear series of operations.
 4. The `Spider` receives the response and parses the response, returning new `Request`s and new parsed items
 5. Parsed items are post-processed individually. New `Request`s from the `Spider` goes to step 2
 
-
 ## Spiders
 
 Spiders are modules which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items). In other words, Spiders are the place where you define the custom behaviour for crawling and parsing pages for a particular site.
@@ -34,14 +35,14 @@ All items are processed sequentially and are processed by Item pipelines.
 In order to make a working web crawler, all the behaviour callbacks need to be implemented.
 
 `init()` - a part of the Crawly.Spider behaviour. This function should return a KVList which contains a `start_urls` entry with a list, which defines the starting requests made by Crawly. Alternatively you may provide `start_requests` if it's required
- to prepare first requests on `init()`. Which might be useful if, for example, you
- want to pass a session cookie to the starting request. Note: `start_requests` are
- processed before start_urls.
- ** This callback is going to be deprecated in favour of init/1. For now the backwords 
- compatibility is kept with a help of macro which always generates `init/1`.
+to prepare first requests on `init()`. Which might be useful if, for example, you
+want to pass a session cookie to the starting request. Note: `start_requests` are
+processed before start_urls.
+\*\* This callback is going to be deprecated in favour of init/1. For now the backwords
+compatibility is kept with a help of macro which always generates `init/1`.
 
-`init(options)` same as `init/0` but also takes options (which can be passed from the engine during 
-the spider start). 
+`init(options)` same as `init/0` but also takes options (which can be passed from the engine during
+the spider start).
 
 `base_url()` - defines a base_url of the given Spider. This function is used in order to filter out all requests which are going outside of the crawled website.
 
@@ -129,9 +130,15 @@ Built-in middlewares:
      Crawly.Middlewares.AutoCookiesManager
    ```
 
+### Response Parsers
+
+> **Item Pipelines:** a pipeline module that parses a fetcher's request. If declared, a spider's `c:Crawly.Spider.parse_item\1` callback is ignored. It is unused by default. It implements the `Crawly.Pipeline` behaviour.
+
+Parsers allow for logic reuse when spiders parse a fetcher's response.
+
 ### Item Pipelines
 
-> **Item Pipelines:** a pipeline module that modifies and pre-processes a scraped item.
+> **Item Pipelines:** a pipeline module that modifies and pre-processes a scraped item. It implements the `Crawly.Pipeline` behaviour.
 
 Built-in item pipelines:
 
@@ -173,14 +180,14 @@ defmodule MyCustomPipeline do
 end
 ```
 
-
 ### Best Practices
 
 The use of global configs is discouraged, hence one pass options through a tuple-based pipeline declaration where possible.
 
 When storing information in the `state` map, ensure that the state is namespaced with the pipeline name, so as to avoid key clashing. For example, to store state from `MyEctoPipeline`, store the state on the key `:my_ecto_pipeline_my_state`.
 
 ### Custom Request Middlewares
+
 #### Request Middleware Example - Add a Proxy
 
 Following the [documentation](https://hexdocs.pm/httpoison/HTTPoison.Request.html) for proxy options of a request in `HTTPoison`, we can do the following:
@@ -206,14 +213,16 @@ defmodule MyApp.MyProxyMiddleware do
 end
 ```
 
-
 ### Custom Item Pipelines
+
 Item pipelines receives the parsed item (from the Spider) and performs post-processing on the item.
 
 #### Storing Parsed Items
+
 You can use custom item pipelines to save the item to custom storages.
 
 ##### Example - Ecto Storage Pipeline
+
 In this example, we insert the scraped item into a table with Ecto. This example does not directly call `MyRepo.insert`, but delegates it to an application context function.
 
 ```elixir
@@ -233,50 +242,58 @@ end
 ```
 
 #### Multiple Different Types of Parsed Items
+
 If you need to selectively post-process different types of scraped items, you can utilize pattern-matching at the item pipeline level.
 
 There are two general methods of doing so:
+
 1. Struct-based pattern matching
-  ```elixir
-  defmodule MyApp.MyCustomPipeline do
-    @impl Crawly.Pipeline
-    def run(%MyItem{} = item, state, _opts \\ []) do
-      # do something
-    end
-    # do nothing if it does not match
-    def run(item, state, _opts), do: {item, state}
+
+```elixir
+defmodule MyApp.MyCustomPipeline do
+  @impl Crawly.Pipeline
+  def run(%MyItem{} = item, state, _opts \\ []) do
+    # do something
   end
-  ```
+  # do nothing if it does not match
+  def run(item, state, _opts), do: {item, state}
+end
+```
+
 2. Key-based pattern matching
-  ```elixir
-  defmodule MyApp.MyCustomPipeline do
-    @impl Crawly.Pipeline
-    def run(%{my_item: my_item} = item, state, _opts \\ []) do
-      # do something
-    end
-    # do nothing if it does not match
-    def run(item, state, _opts), do: {item, state}
+
+```elixir
+defmodule MyApp.MyCustomPipeline do
+  @impl Crawly.Pipeline
+  def run(%{my_item: my_item} = item, state, _opts \\ []) do
+    # do something
   end
-  ```
+  # do nothing if it does not match
+  def run(item, state, _opts), do: {item, state}
+end
+```
 
 Use struct-based pattern matching when:
+
 1. you want to utilize existing Ecto schemas
 2. you have pre-defined structs that you want to conform to
 
 Use key-based pattern matching when:
+
 1. you want to process two or more related and inter-dependent items together
 2. you want to bulk process multiple items for efficiency reasons. For example, processing the weather data for 365 days in one pass.
 
 ##### Caveats
+
 When using the nested-key pattern matching method, the spider's `Crawly.Spider.parse_item/1` callback will need to return items with a single key (or a map with multiple keys, if doing related processing).
 
 When using struct-based pattern matching with existing Ecto structs, you will need to do an intermediate conversion of the struct into a map before performing the insertion into the Ecto Repo. This is due to the underlying Ecto schema metadata still being attached to the struct before insertion.
 
 ##### Example - Multi-Item Pipelines With Pattern Matching
+
 In this example, your spider scrapes a "blog post" and a "weather data" from a website.
 We will use the key-based pattern matching approach to selectively post-process a blog post parsed item.
 
-
 ```elixir
 # in MyApp.CustomSpider.ex
 def parse_item(response):
@@ -286,8 +303,10 @@ def parse_item(response):
     %{weather: [ january_weather, february_weather ]}
   ]}
 ```
+
 Then, in the custom pipeline, we will pattern match on the `:blog_post` key, to ensure that we only process blog posts with this pipeline (and not weather data).
 We then update the `:blog_post` key of the received item.
+
 ```elixir
 defmodule MyApp.BlogPostPipeline do
   @impl Crawly.Pipeline
@@ -328,7 +347,7 @@ See: https://splash.readthedocs.io/en/stable/api.html
 You can try using Splash with Crawly in the following way:
 
 1. Start splash locally (e.g. using a docker image):
-` docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300`
+   `docker run -it -p 8050:8050 scrapinghub/splash --max-timeout 300`
 2. Configure Crawly to use Splash:
-`fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]}`
-3. Now all your pages will be automatically rendered by Splash.
+   `fetcher: {Crawly.Fetchers.Splash, [base_url: "http://localhost:8050/render.html"]}`
+3. Now all your pages will be automatically rendered by Splash.
diff --git a/documentation/configuration.md b/documentation/configuration.md
@@ -16,15 +16,53 @@ config :crawly,
 
 ## Options
 
+### middlewares :: [module()]
+
+Defines a list of middlewares responsible for pre-processing requests. If any of the requests from the `Crawly.Spider` is not passing the middleware configuration on the `middlewares` key, it's dropped.
+
+Refer to `Crawly.Pipeline` for more information on the structure of a middleware.
+
+```elixir
+# Example middlewares
+config :crawly,
+  middlewares: [
+    Crawly.Middlewares.DomainFilter,
+    Crawly.Middlewares.UniqueRequest,
+    Crawly.Middlewares.RobotsTxt,
+    # With options
+    {Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] },
+    {Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}
+  ]
+```
+
+### `parsers` :: [module()]
+
+default: nil
+
+By default, parsers are unused, and a `Crawly.Spider`'s `parse_item/1` will be called to parse a fetcher's response. However, it is possible utilize Crawly's built-in parsers or your own custom logic to parse responses from a fetcher.
+
+> **IMPORTANT**: If set at the global, **ALL** spiders will not have their `parse_item/1` callback used. It is advised to declare these at the non-global level using spider-level settings overrides.
+
+Each parser may have additional peer dependencies if used. Refer to the documentation for each parser to know the specific requirements for each.
+
+Refer to `Crawly.Pipeline` for more information on the structure of a parser.
+
+```elixir
+config :crawly,
+  parsers: [
+    {Crawly.Parsers.ExtractRequests, selector: "button"},
+    ]
+```
+
 ### `pipelines` :: [module()]
 
 default: []
 
 Defines a list of pipelines responsible for pre processing all the scraped items. All items not passing any of the pipelines are dropped. If unset, all items are stored without any modifications.
 
-Example configuration of item pipelines:
+Refer to `Crawly.Pipeline` for more information on the structure of a pipeline.
 
-```
+```elixir
 config :crawly,
   pipelines: [
     {Crawly.Pipelines.Validate, fields: [:id, :date]},
@@ -34,22 +72,6 @@ config :crawly,
     ]
 ```
 
-### middlewares :: [module()]
-
-Example middlewares
-```elixir
-config :crawly,
-  middlewares: [
-    Crawly.Middlewares.DomainFilter,
-    Crawly.Middlewares.UniqueRequest,
-    Crawly.Middlewares.RobotsTxt,
-    {Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] },
-    {Crawly.Middlewares.RequestOptions, [timeout: 30_000, recv_timeout: 15000]}
-  ]
-```
-
-Defines a list of middlewares responsible for pre-processing requests. If any of the requests from the `Crawly.Spider` is not passing the middleware, it's dropped.
-
 ### closespider_itemcount :: pos_integer() | :disabled
 
 default: :disabled
@@ -62,7 +84,6 @@ default: nil (disabled by default)
 
 Defines a minimal amount of items which needs to be scraped by the spider within the given timeframe (1m). If the limit is not reached by the spider - it will be stopped.
 
-
 ### concurrent_requests_per_domain :: pos_integer()
 
 default: 4
@@ -78,26 +99,28 @@ NOTE: A worker's speed if often limited by the speed of the actual HTTP client a
 ### retry :: Keyword list
 
 Allows to configure the retry logic. Accepts the following configuration options:
-1) *retry_codes*: Allows to specify a list of HTTP codes which are treated as
+
+1. _retry_codes_: Allows to specify a list of HTTP codes which are treated as
    failed responses. (Default: [])
 
-2) *max_retries*: Allows to specify the number of attempts before the request is
+2. _max_retries_: Allows to specify the number of attempts before the request is
    abandoned. (Default: 0)
 
-3) *ignored_middlewares*: Allows to modify the list of processors for a given 
-   requests when retry happens. (Will be required to avoid clashes with 
+3. _ignored_middlewares_: Allows to modify the list of processors for a given
+   requests when retry happens. (Will be required to avoid clashes with
    Unique.Request middleware).
-   
+
 Example:
-   ```
-        retry:
-            [
-              retry_codes: [400],
-              max_retries: 3,
-              ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
-          ]
 
-   ```
+```
+     retry:
+         [
+           retry_codes: [400],
+           max_retries: 3,
+           ignored_middlewares: [Crawly.Middlewares.UniqueRequest]
+       ]
+
+```
 
 ### fetcher :: atom()
 
@@ -109,7 +132,7 @@ Allows to specify a custom HTTP client which will be performing request to the c
 
 default: /tmp
 
-Set spider logs directory. All spiders have their own dedicated log file 
+Set spider logs directory. All spiders have their own dedicated log file
 stored under the `log_dir` folder.
 
 ### port :: pos_integer()
@@ -131,6 +154,7 @@ It's possible to override most of the setting on a spider level. In order to do
 it is required to define the `override_settings/0` callback in your spider.
 
 For example:
+
 ```elixir
 def override_settings() do
    [
@@ -141,10 +165,12 @@ end
 ```
 
 The full list of overridable settings:
-  - closespider_itemcount,
-  - closespider_timeout,
-  - concurrent_requests_per_domain,
-  - fetcher,
-  - retry,
-  - middlewares,
-  - pipelines
+
+- closespider_itemcount,
+- closespider_timeout,
+- concurrent_requests_per_domain,
+- fetcher,
+- retry,
+- pasers,
+- middlewares (has known [bugs](https://github.com/oltarasenko/crawly/issues/138))
+- pipelines
diff --git a/lib/crawly/parsed_item.ex b/lib/crawly/parsed_item.ex
@@ -1,6 +1,9 @@
 defmodule Crawly.ParsedItem do
   @moduledoc """
-  Defines the structure of spider's result
+  Defines the structure of spider's result.
+
+  ## Usage with Parsers
+  A `%ParsedItem{}` is piped through each parser pipeline module when it is declared. Refer to `Crawly.Pipeline` for further documentation.
   """
 
   defstruct items: [], requests: []