Skip to content

Commit

Permalink
Merge 0329136 into cba006c
Browse files Browse the repository at this point in the history
  • Loading branch information
oltarasenko committed Aug 11, 2019
2 parents cba006c + 0329136 commit 6c78ee4
Showing 1 changed file with 81 additions and 2 deletions.
83 changes: 81 additions & 2 deletions README.md
Expand Up @@ -13,17 +13,96 @@ historical archival.
1. Elixir "~> 1.7"
2. Works on Linux, Windows, OS X and BSD

# Install
# Installation

1. Generate an new Elixir project: `mix new <project_name> --sup`
2. Add Crawly to you mix.exs file
```elixir
def deps do
[{:crawly, "~> 0.3.0"}]
[{:crawly, "~> 0.5.0"}]
end
```
3. Fetch crawly: `mix deps.get`

# Quickstart

In this section we will show how to bootstrap the small project and to setup
Crawly for a proper data extraction.

1. Create a new Elixir project: `mix new crawly_example --sup`
2. Add Crawly to the dependencies (mix.exs file):
```elixir
defp deps do
[
{:crawly, "~> 0.5.0"}
]
end
```
3. Fetch dependencies: `mix deps.get`
4. Define Crawling rules (Spider)
```elixir
cat > lib/crawly_example/esl_spider.ex << EOF
defmodule EslSpider do
@behaviour Crawly.Spider
alias Crawly.Utils

@impl Crawly.Spider
def base_url(), do: "https://www.erlang-solutions.com"

@impl Crawly.Spider
def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]

@impl Crawly.Spider
def parse_item(response) do
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

requests =
Utils.build_absolute_urls(hrefs, base_url())
|> Utils.requests_from_urls()

title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()

%{
:requests => requests,
:items => [%{title: title, url: response.request_url}]
}
end
end
EOF
```

5. Configure Crawly:
By default Crawly does not require any configuration. But obviusely you will need
a configuration for fine tuning the Crawls:

```elixir
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
follow_redirects: true,
closespider_itemcount: 1000,
output_format: "csv",
item: [:title, :url],
item_id: :title,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
Crawly.Pipelines.Validate,
Crawly.Pipelines.DuplicatesFilter,
Crawly.Pipelines.CSVEncoder
]
```


6. Start the Crawl:
- `iex -S mix`
- `Crawly.Engine.start_spider(EslSpider)`

7. Results can be seen in: `cat /tmp/EslSpider.jl`


# Documentation

Expand Down

0 comments on commit 6c78ee4

Please sign in to comment.