Merge 2f3a1bd into 6b4874d

elixir-crawly · Dec 25, 2019 · 837dfe6 · 837dfe6
2 parents 6b4874d + 2f3a1bd
commit 837dfe6
Show file tree

Hide file tree

Showing 22 changed files with 587 additions and 101 deletions.
diff --git a/documentation/basic_concepts.md b/documentation/basic_concepts.md
@@ -77,28 +77,39 @@ The parsed item is being processed by Crawly.Worker process, which sends all req
 
 For now only one Storage backend is supported (writing on disc). But in future Crawly will also support work with amazon S3, sql and others.
 
+## The `Crawly.Pipeline` Behaviour.
+
+Crawly is using a concept of pipelines when it comes to processing of the elements sent to the system. This is applied to both request and scraped item manipulation. Conceptually, requests go through a series of manipulations, before the response is fetched. The response then goes through another different series of manipulations.
+
+Importantly, the way that requests and responses are manipulated are abstracted into the `Crawly.Pipeline` behaviour. This allows for a modular system for declaring changes. It is also to be noted that Each `Crawly.Pipeline` module, when declared, are applied sequentially through the `Crawly.Utils.pipe/3` function.
+
+### Writing Tests for Custom Pipelines
+
+Modules that implement the `Crawly.Pipeline` behaviour can make use of the `Crawly.Utils.pipe/3` function to test for expected behaviour. Refer to the function documentation for more information and examples.
+
 ## Request Middlewares
 
 These are configured under the `middlewares` option. See [configuration](./configuration.md) for more details.
 
 > **Middleware:** A pipeline module that modifies a request. It implements the `Crawly.Pipeline` behaviour.
 
-List of built-in middlewares:
+Middlewares are able to make changes to the underlying request, a `Crawly.Request` struct. The request, along with any options specified, is then passed to the fetcher (currently `HTTPoison`).
+The available configuration options should correspond to the underlying options of the fetcher in use.
+
+Note that all request configuration options for `HTTPoison`, such as proxy, ssl, etc can be configured through `Crawly.Request.options`.
+
+Built-in middlewares:
 
 1. `Crawly.Middlewares.DomainFilter` - this middleware will disable scheduling for all requests leading outside of the crawled site.
 2. `Crawly.Middlewares.RobotsTxt` - this middleware ensures that Crawly respects the robots.txt defined by the target website.
 3. `Crawly.Middlewares.UniqueRequest` - this middleware ensures that crawly would not schedule the same URL(request) multiple times.
 4. `Crawly.Middlewares.UserAgent` - this middleware is used to set a User Agent HTTP header. Allows to rotate UserAgents, if the last one is defined as a list.
 
-### Creating a Custom Request Middleware
-
-TODO
-
-## Item Pipelines
+### Item Pipelines
 
-Crawly is using a concept of pipelines when it comes to processing of the elements sent to the system. In this section we will cover the topic of item pipelines - a tool which is used in order to pre-process items before storing them in the storage.
+> **Item Pipelines:** a pipeline module that modifies and pre-processes a scraped item.
 
-At this point Crawly includes the following Item pipelines:
+Built-in item pipelines:
 
 1.  `Crawly.Pipelines.Validate` - validates that a given item has all the required fields. All items which don't have all required fields are dropped.
 2.  `Crawly.Pipelines.DuplicatesFilter` - filters out items which are already stored the system.
@@ -108,22 +119,48 @@ At this point Crawly includes the following Item pipelines:
 
 The list of item pipelines used with a given project is defined in the project settings.
 
-### Creating a Custom Item Pipeline
+## Creating a Custom Pipeline Module
+
+Both item pipelines and request middlewares follows the `Crawly.Pipeline` behaviour. As such, when creating your custom pipeline, it will need to implement the required callback `c:Crawly.Pipeline.run\3`.
+
+The `c:Crawly.Pipeline.run\3` callback receives the processed item, `item` from the previous pipeline module as the first argument. The second argument, `state`, is a map containing information such as spider which the item originated from (under the `:spider_name` key), and may optionally store pipeline information. Finally, `opts` is a keyword list containing any tuple-based options.
+
+### Passing Configuration Options To Your Pipeline
+
+Tuple-based option declaration is supported, similar to how a `GenServer` is declared in a supervision tree. This allows for pipeline reusability for different use cases.
+
+For example, you can pass options in this way through your pipeline declaration:
+
+```elixir
+pipelines: [
+  {MyCustomPipeline, my_option: "value"}
+]
+```
+
+In your pipeline, you will then receive the options passed through the `opts` argument.
 
-An item pipeline follows the `Crawly.Pipeline` behaviour. As such, when creating your custom pipeline, it will need to implement the required callback `c:Crawly.Pipeline.run\2`.
+```elixir
+defmodule MyCustomPipeline do
+  @impl Crawly.Pipeline
+  def run(item, state, opts) do
+    IO.inspect(opts)        # shows keyword list of  [ my_option: "value" ]
+    # Do something
+  end
+end
+```
 
-> **Note**: [PR #31](https://github.com/oltarasenko/crawly/pull/31) aims to allow tuple-based option declaration, similar to how a `GenServer` is declared ina supervision tree.
+### Best Practices
 
-The `c:Crawly.Pipeline.run\2` callback receives the processed item, `item` from the previous pipeline module as the first argument. The second argument, `state`, is a map containing information such as spider which the item originated from (under the `:spider_name` key), and may optionally store pipeline information.
+The use of global configs is discouraged, hence one pass options through a tuple-based pipeline declaration where possible.
 
 When storing information in the `state` map, ensure that the state is namespaced with the pipeline name, so as to avoid key clashing. For example, to store state from `MyEctoPipeline`, store the state on the key `:my_ecto_pipeline_my_state`.
 
-#### Example - Ecto Storage Pipeline
+### Item Pipeline Example - Ecto Storage Pipeline
 
-```elxiir
+```elixir
 defmodule MyApp.MyEctoPipeline do
   @impl Crawly.Pipeline
-  def run(item, state) do
+  def run(item, state, _opts \\ []) do
     case MyApp.insert_with_ecto(item) do
       {:ok, _} ->
         # insert successful, carry on with pipeline
@@ -135,3 +172,29 @@ defmodule MyApp.MyEctoPipeline do
   end
 end
 ```
+
+### Request Middleware Example - Add a Proxy
+
+Following the [documentation](https://hexdocs.pm/httpoison/HTTPoison.Request.html) for proxy options of a request in `HTTPoison`, we can do the following:
+
+```elixir
+defmodule MyApp.MyProxyMiddleware do
+  @impl Crawly.Pipeline
+  def run(request, state, opts \\ []) do
+    # Set default proxy and proxy_auth to nil
+    opts = Enum.into(opts, %{proxy: nil, proxy_auth: nil})
+
+    case opts.proxy do
+      nil ->
+        # do nothing
+        {request, state}
+      value ->
+        old_options = request.options
+        new_options = [proxy: opts.proxy, proxy_auth: opts.proxy_auth]
+        new_request =  Map.put(request, :options, old_optoins ++ new_options)
+        {new_request, state}
+    end
+
+  end
+end
+```
diff --git a/documentation/configuration.md b/documentation/configuration.md
@@ -6,10 +6,12 @@ A basic example:
 
 ```elixir
 config :crawly,
-  # Item definition
-  item: [:title, :author, :time, :url],
-  # Identifier which is used to filter out duplicates
-  item_id: :title
+  pipelines: [
+    # my pipelines
+  ]
+  middlewares: [
+    # my middlewares
+  ]
 ```
 
 ## Options
@@ -32,6 +34,8 @@ by the `Crawly.Middlewares.UserAgent` middleware. When the list has more than on
 item, all requests will be executed, each with a user agent string chosen
 randomly from the supplied list.
 
+> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Middlewares.UserAgent` module documentation for correct usage.
+
 ### `item` :: [atom()]
 
 default: []
@@ -41,6 +45,8 @@ fields are added to the following item (or if the values of
 required fields are "" or nil), the item will be dropped. This setting
 is used by the `Crawly.Pipelines.Validate` pipeline
 
+> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Pipelines.Validate` module documentation for correct usage.
+
 ### `item_id` :: atom()
 
 default: nil
@@ -51,6 +57,8 @@ field is the SKU. This setting is used in
 the `Crawly.Pipelines.DuplicatesFilter` pipeline. If unset, the related
 middleware is effectively disabled.
 
+> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Pipelines.DuplicatesFilter` module documentation for correct usage.
+
 ### `pipelines` :: [module()]
 
 default: []
@@ -62,21 +70,25 @@ Example configuration of item pipelines:
 ```
 config :crawly,
   pipelines: [
-    Crawly.Pipelines.Validate,
-    Crawly.Pipelines.DuplicatesFilter,
+    {Crawly.Pipelines.Validate, fields: [:id, :date]},
+    {Crawly.Pipelines.DuplicatesFilter, item_id: :id},
     Crawly.Pipelines.JSONEncoder,
-    Crawly.Pipelines.WriteToFile # NEW IN 0.6.0
+    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.6.0
     ]
 ```
 
 ### middlewares :: [module()]
 
-default: [
-Crawly.Middlewares.DomainFilter,
-Crawly.Middlewares.UniqueRequest,
-Crawly.Middlewares.RobotsTxt,
-Crawly.Middlewares.UserAgent
-]
+```elixir
+The default middlewares are as follows:
+config :crawly,
+  middlewares: [
+    Crawly.Middlewares.DomainFilter,
+    Crawly.Middlewares.UniqueRequest,
+    Crawly.Middlewares.RobotsTxt,
+    {Crawly.Middlewares.UserAgent, user_agents: ["My Bot"] }
+  ]
+```
 
 Defines a list of middlewares responsible for pre-processing requests. If any of the requests from the `Crawly.Spider` is not passing the middleware, it's dropped.
 

diff --git a/documentation/quickstart.md b/documentation/quickstart.md
@@ -58,19 +58,16 @@ Goals:
      concurrent_requests_per_domain: 8,
      follow_redirects: true,
      closespider_itemcount: 1000,
-     output_format: "csv",
-     item: [:title, :url],
-     item_id: :title,
      middlewares: [
        Crawly.Middlewares.DomainFilter,
        Crawly.Middlewares.UniqueRequest,
        Crawly.Middlewares.UserAgent
      ],
      pipelines: [
-       Crawly.Pipelines.Validate,
-       Crawly.Pipelines.DuplicatesFilter,
-       Crawly.Pipelines.CSVEncoder,
-       Crawly.Pipelines.WriteToFile
+       {Crawly.Pipelines.Validate, fields: [:title, :url]},
+       {Crawly.Pipelines.DuplicatesFilter, item_id: :title },
+       {Crawly.Pipelines.CSVEncoder, fields: [:title, :url}],
+       {Crawly.Pipelines.WriteToFile, extension: "csv", folder: "/tmp" }
      ]
    ```
 5. Start the Crawl:

diff --git a/lib/crawly/middlewares/domain_filter.ex b/lib/crawly/middlewares/domain_filter.ex
@@ -1,12 +1,23 @@
 defmodule Crawly.Middlewares.DomainFilter do
   @moduledoc """
-  Filters out requests which are going outside of the crawled domain
+  Filters out requests which are going outside of the crawled domain.
+
+  The domain that is used to compare against the request url is obtained from the spider's `c:Crawly.Spider.base_url` callback.
+
+  Does not accept any options. Tuple-based configuration optionswill be ignored.
+
+  ### Example Declaration
+  ```
+  middlewares: [
+    Crawly.Middlewares.DomainFilter
+  ]
+  ```
   """
 
   @behaviour Crawly.Pipeline
   require Logger
 
-  def run(request, state) do
+  def run(request, state, _opts \\ []) do
     base_url = state.spider_name.base_url()
 
     case String.contains?(request.url, base_url) do

diff --git a/lib/crawly/middlewares/robotstxt.ex b/lib/crawly/middlewares/robotstxt.ex
@@ -6,20 +6,24 @@ defmodule Crawly.Middlewares.RobotsTxt do
   crawler can or can't request from your site. This is used mainly to avoid
   overloading a site with requests!
 
-  Please NOTE:
-  The first rule of web crawling is you do not harm the website.
-  The second rule of web crawling is you do NOT harm the website
+  No options are required for this middleware. Any tuple-based configurations options passed will be ignored.
+
+
+  ### Example Declaration
+  ```
+  middlewares: [
+    Crawly.Middlewares.RobotsTxt
+  ]
+  ```
   """
 
   @behaviour Crawly.Pipeline
   require Logger
 
-  def run(request, state) do
+  def run(request, state, _opts \\ []) do
     case Gollum.crawlable?("Crawly", request.url) do
       :uncrawlable ->
-        Logger.debug(
-          "Dropping request: #{request.url} (robots.txt filter)"
-        )
+        Logger.debug("Dropping request: #{request.url} (robots.txt filter)")
 
         {false, state}
 

diff --git a/lib/crawly/middlewares/user_agent.ex b/lib/crawly/middlewares/user_agent.ex
@@ -4,16 +4,34 @@ defmodule Crawly.Middlewares.UserAgent do
   :crawly, :user_agents sessions.
 
   The default value for the user agent is: Crawly Bot 1.0
+
+  Rotation is determined through `Enum.random/1`.
+  ### Options
+  - `:user_agents`, optional. A list of user agent strings to rotate. Defaults to "Crawly Bot 1.0".
+
+  ### Example Declaration
+  ```
+  middlewares: [
+    {UserAgent, user_agents: ["My Custom Bot] }
+  ]
+  ```
   """
   require Logger
 
-  def run(request, state) do
+  def run(request, state, opts \\ []) do
+    opts = Enum.into(opts, %{user_agents: nil})
+
     new_headers = List.keydelete(request.headers, "User-Agent", 0)
-    user_agents = Application.get_env(:crawly, :user_agents, ["Crawly Bot 1.0"])
+
+    user_agents =
+      Map.get(opts, :user_agents) ||
+        Application.get_env(:crawly, :user_agents, ["Crawly Bot 1.0"])
+
     useragent = Enum.random(user_agents)
 
     new_request =
       Map.put(request, :headers, [{"User-Agent", useragent} | new_headers])
+
     {new_request, state}
   end
 end
diff --git a/lib/crawly/pipeline.ex b/lib/crawly/pipeline.ex
@@ -5,10 +5,17 @@ defmodule Crawly.Pipeline do
   A pipeline is a module which takes a given item, and executes a
   run callback on a given item.
 
-  A state variable is used to share common information accros multiple
+  A state argument is used to share common information accros multiple
   items.
+
+  An `opts` argument is used to pass configuration to the pipeline through tuple-based declarations.
   """
   @callback run(item :: map, state :: map()) ::
               {new_item :: map, new_state :: map}
               | {false, new_state :: map}
+
+  @callback run(item :: map, state :: map(), args :: list(any())) ::
+              {new_item :: map, new_state :: map}
+              | {false, new_state :: map}
+  @optional_callbacks run: 3
 end
diff --git a/lib/crawly/pipelines/csv_encoder.ex b/lib/crawly/pipelines/csv_encoder.ex
@@ -1,13 +1,33 @@
 defmodule Crawly.Pipelines.CSVEncoder do
   @moduledoc """
-  Encodes a given item (map) into CSV
+  Encodes a given item (map) into CSV. Does not flatten nested maps.
+  ### Options
+  If no fields are given, the item is dropped from the pipeline.
+  - `:fields`, required: The fields to extract out from the scraped item. Falls back to the global config `:item`.
+
+  ### Example Usage
+    iex> item = %{my: "first", other: "second", ignore: "this_field"}
+    iex> Crawly.Pipelines.CSVEncoder.run(item, %{}, fields: [:my, :other])
+    {"first,second", %{}}
   """
   @behaviour Crawly.Pipeline
+  require Logger
 
   @impl Crawly.Pipeline
-  def run(item, state) do
-    case Application.get_env(:crawly, :item) do
+  @spec run(map, map, fields: list(atom)) ::
+          {false, state :: map} | {csv_line :: String.t(), state :: map}
+  def run(item, state, opts \\ []) do
+    opts = Enum.into(opts, %{fields: nil})
+    fields = Map.get(opts, :fields) || Application.get_env(:crawly, :item)
+
+    case fields do
       :undefined ->
+        # only for when both tuple and global config is not provided
+
+        Logger.info(
+          "Dropping item: #{inspect(item)}. Reason: No fields declared for CSVEncoder"
+        )
+
         {false, state}
 
       fields ->