diff --git a/README.md b/README.md index 829651dc..fcc383ff 100644 --- a/README.md +++ b/README.md @@ -13,17 +13,96 @@ historical archival. 1. Elixir "~> 1.7" 2. Works on Linux, Windows, OS X and BSD -# Install +# Installation 1. Generate an new Elixir project: `mix new --sup` 2. Add Crawly to you mix.exs file ```elixir def deps do - [{:crawly, "~> 0.3.0"}] + [{:crawly, "~> 0.5.0"}] end ``` 3. Fetch crawly: `mix deps.get` +# Quickstart + +In this section we will show how to bootstrap the small project and to setup +Crawly for a proper data extraction. + +1. Create a new Elixir project: `mix new crawly_example --sup` +2. Add Crawly to the dependencies (mix.exs file): +```elixir +defp deps do + [ + {:crawly, "~> 0.5.0"} + ] +end +``` +3. Fetch dependencies: `mix deps.get` +4. Define Crawling rules (Spider) +```elixir +cat > lib/crawly_example/esl_spider.ex << EOF +defmodule EslSpider do + @behaviour Crawly.Spider + alias Crawly.Utils + + @impl Crawly.Spider + def base_url(), do: "https://www.erlang-solutions.com" + + @impl Crawly.Spider + def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]] + + @impl Crawly.Spider + def parse_item(response) do + hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href") + + requests = + Utils.build_absolute_urls(hrefs, base_url()) + |> Utils.requests_from_urls() + + title = response.body |> Floki.find("article.blog_post h1") |> Floki.text() + + %{ + :requests => requests, + :items => [%{title: title, url: response.request_url}] + } + end +end +EOF +``` + +5. Configure Crawly: +By default Crawly does not require any configuration. But obviusely you will need +a configuration for fine tuning the Crawls: + +```elixir +config :crawly, + closespider_timeout: 10, + concurrent_requests_per_domain: 8, + follow_redirects: true, + closespider_itemcount: 1000, + output_format: "csv", + item: [:title, :url], + item_id: :title, + middlewares: [ + Crawly.Middlewares.DomainFilter, + Crawly.Middlewares.UniqueRequest, + Crawly.Middlewares.UserAgent + ], + pipelines: [ + Crawly.Pipelines.Validate, + Crawly.Pipelines.DuplicatesFilter, + Crawly.Pipelines.CSVEncoder + ] +``` + + +6. Start the Crawl: + - `iex -S mix` + - `Crawly.Engine.start_spider(EslSpider)` + +7. Results can be seen in: `cat /tmp/EslSpider.jl` + # Documentation