Skip to content

Commit

Permalink
Re-add quickstart to the main documentation page on GH
Browse files Browse the repository at this point in the history
I decided to have quickstart always up-to-date and always available
on the github homepage of the project.

E.g. the idea is to show how to boostrap it and get going even before
reading the doc.
  • Loading branch information
oltarasenko committed Dec 23, 2019
1 parent 838049e commit b169713
Showing 1 changed file with 72 additions and 8 deletions.
80 changes: 72 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,79 @@ historical archival.
1. Elixir "~> 1.7"
2. Works on Linux, Windows, OS X and BSD

## Installation

Add Crawly to you mix.exs file

```elixir
def deps do
[{:crawly, "~> 0.6.0"}]
end
```
## Quickstart

1. Add Crawly as a dependencies:
```elixir
# mix.exs
defp deps do
[
{:crawly, "~> 0.6.0"}
]
end
```
2. Fetch dependencies: `$ mix deps.get`
3. Create a spider

```elixir
# lib/crawly_example/esl_spider.ex
defmodule EslSpider do
@behaviour Crawly.Spider
alias Crawly.Utils

@impl Crawly.Spider
def base_url(), do: "https://www.erlang-solutions.com"

@impl Crawly.Spider
def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]

@impl Crawly.Spider
def parse_item(response) do
hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")

requests =
Utils.build_absolute_urls(hrefs, base_url())
|> Utils.requests_from_urls()

title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()

%{
:requests => requests,
:items => [%{title: title, url: response.request_url}]
}
end
end
```

4. Configure Crawly
- By default, Crawly does not require any configuration. But obviously you will need a configuration for fine tuning the crawls:
```elixir
# in config.exs
config :crawly,
closespider_timeout: 10,
concurrent_requests_per_domain: 8,
follow_redirects: true,
closespider_itemcount: 1000,
output_format: "csv",
item: [:title, :url],
item_id: :title,
middlewares: [
Crawly.Middlewares.DomainFilter,
Crawly.Middlewares.UniqueRequest,
Crawly.Middlewares.UserAgent
],
pipelines: [
Crawly.Pipelines.Validate,
Crawly.Pipelines.DuplicatesFilter,
Crawly.Pipelines.CSVEncoder,
Crawly.Pipelines.WriteToFile
]
```
5. Start the Crawl:
- `$ iex -S mix`
- `iex(1)> Crawly.Engine.start_spider(EslSpider)`
6. Results can be seen with: `$ cat /tmp/EslSpider.csv`

## Documentation

Expand Down

0 comments on commit b169713

Please sign in to comment.