Skip to content

Commit

Permalink
Address codereview comments
Browse files Browse the repository at this point in the history
  • Loading branch information
oltarasenko committed Apr 1, 2020
1 parent 207412a commit 905b3a5
Show file tree
Hide file tree
Showing 6 changed files with 16 additions and 66 deletions.
2 changes: 1 addition & 1 deletion config/config.exs
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ config :crawly,
],
pipelines: [
{Crawly.Pipelines.Validate, fields: [:title, :author, :time, :url]},
{Crawly.Pipelines.DuplicatesFilter, item_id: :titile},
{Crawly.Pipelines.DuplicatesFilter, item_id: :title},
Crawly.Pipelines.JSONEncoder
]

Expand Down
48 changes: 0 additions & 48 deletions documentation/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,49 +16,6 @@ config :crawly,

## Options

### base_store_path :: binary() [DEPRECATED in 0.6.0]

default: "/tmp"

Defines the path where items are stored in the filesystem. This setting
is used by the Crawly.DataStorageWorker process.

> **Deprecated**: This has been deprecated in favour of having pipelines to handle data storage, as of `0.6.0`
### `user_agents` :: list() [DEPRECATED in 0.9.0, use middleware based configuration]

default: ["Crawly Bot 1.0"]

Defines a user agent string for Crawly requests. This setting is used
by the `Crawly.Middlewares.UserAgent` middleware. When the list has more than one
item, all requests will be executed, each with a user agent string chosen
randomly from the supplied list.

> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Middlewares.UserAgent` module documentation for correct usage.
### `item` :: [atom()]

default: []

Defines a list of required fields for the item. When none of the default
fields are added to the following item (or if the values of
required fields are "" or nil), the item will be dropped. This setting
is used by the `Crawly.Pipelines.Validate` pipeline

> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Pipelines.Validate` module documentation for correct usage.
### `item_id` :: atom()

default: nil

Defines a field which will be used in order to identify if an item is
a duplicate or not. In most of the ecommerce websites the desired id
field is the SKU. This setting is used in
the `Crawly.Pipelines.DuplicatesFilter` pipeline. If unset, the related
middleware is effectively disabled.

> **Deprecated**: This has been deprecated in favour of tuple-based pipeline configuration instead of global configurations, as of `0.7.0`. Refer to `Crawly.Pipelines.DuplicatesFilter` module documentation for correct usage.
### `pipelines` :: [module()]

default: []
Expand Down Expand Up @@ -105,11 +62,6 @@ default: nil

Defines a minimal amount of items which needs to be scraped by the spider within the given timeframe (30s). If the limit is not reached by the spider - it will be stopped.

### follow_redirect :: boolean() [Deprecated, use fetcher settings instead]

default: false

Defines is Crawly spider is supposed to follow HTTP redirects or not.

### concurrent_requests_per_domain :: pos_integer()

Expand Down
18 changes: 9 additions & 9 deletions lib/crawly/settings.ex
Original file line number Diff line number Diff line change
Expand Up @@ -25,24 +25,24 @@ defmodule Crawly.Settings do
| {Crawly.Pipelines.WriteToFile,
folder: binary(), extension: binary()}

@type t() :: %{
@type t() :: [
# Allows to stop spider after a given number of scraped items
# :disabled by default.
optional(:closespider_itemcount) => numeric_setting(),
closespider_itemcount: numeric_setting(),

# Allows to stop spider if it extracts less than a given amount of
# items per minute.
optional(:closespider_timeout) => pos_integer(),
closespider_timeout: pos_integer(),

# Allows to control how many workers are started for a given domain
optional(:concurrent_requests_per_domain) => pos_integer(),
concurrent_requests_per_domain: pos_integer(),

# Allows to define a fetcher to perform HTTP requests
optional(:fetcher) => Crawly.Fetchers.Fetcher.t(),
fetcher: Crawly.Fetchers.Fetcher.t(),

# Defines retries
optional(:retry) => retry(),
optional(:middlewares) => [middleware()],
optional(:pipelines) => [pipeline()]
}
retry: retry(),
middlewares: [middleware()],
pipelines: [pipeline()]
]
end
6 changes: 3 additions & 3 deletions lib/crawly/spider.ex
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ defmodule Crawly.Spider do
request and converting it into items which can be stored and new requests
which can be scheduled
4. `custom_settings/0` an optional callback which can be used in order to
provide custom spider specific settings. Should define a map with custom
provide custom spider specific settings. Should define a list with custom
settings and their values. These values will take precedence over the
global settings defined in the config.
"""
Expand All @@ -22,7 +22,7 @@ defmodule Crawly.Spider do
@callback parse_item(response :: HTTPoison.Response.t()) ::
Crawly.ParsedItem.t()

@callback settings_override() :: Crawly.Settings.t()
@callback override_settings() :: Crawly.Settings.t()

@optional_callbacks settings_override: 0
@optional_callbacks override_settings: 0
end
4 changes: 2 additions & 2 deletions lib/crawly/utils.ex
Original file line number Diff line number Diff line change
Expand Up @@ -150,10 +150,10 @@ defmodule Crawly.Utils do
defp get_spider_setting(_setting_name, nil), do: nil

defp get_spider_setting(setting_name, spider_name) do
case function_exported?(spider_name, :settings_override, 0) do
case function_exported?(spider_name, :override_settings, 0) do
true ->

Map.get(spider_name.settings_override(), setting_name, nil)
Keyword.get(spider_name.override_settings(), setting_name, nil)

false ->
nil
Expand Down
4 changes: 1 addition & 3 deletions test/settings_test.exs
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,5 @@ defmodule Elixir.TestSpiderSettingsOverride do
%{:items => [], :requests => []}
end

def settings_override() do
%{concurrent_requests_per_domain: 5}
end
def override_settings(), do: [concurrent_requests_per_domain: 5]
end

0 comments on commit 905b3a5

Please sign in to comment.