Merge 0329136 into cba006c

elixir-crawly · Aug 11, 2019 · 6c78ee4 · 6c78ee4
2 parents cba006c + 0329136
commit 6c78ee4
Showing 1 changed file with 81 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -13,17 +13,96 @@ historical archival.
 1. Elixir  "~> 1.7"
 2. Works on Linux, Windows, OS X and BSD
 
-# Install
+# Installation
 
 1. Generate an new Elixir project: `mix new <project_name> --sup`
 2. Add Crawly to you mix.exs file
     ```elixir
     def deps do
-        [{:crawly, "~> 0.3.0"}]
+        [{:crawly, "~> 0.5.0"}]
     end
     ```
 3. Fetch crawly: `mix deps.get`
 
+# Quickstart
+
+In this section we will show how to bootstrap the small project and to setup
+Crawly for a proper data extraction.
+
+1. Create a new Elixir project: `mix new crawly_example --sup`
+2. Add Crawly to the dependencies (mix.exs file):
+```elixir
+defp deps do
+    [
+      {:crawly, "~> 0.5.0"}
+    ]
+end
+```
+3. Fetch dependencies: `mix deps.get`
+4. Define Crawling rules (Spider)
+```elixir
+cat > lib/crawly_example/esl_spider.ex << EOF
+defmodule EslSpider do
+  @behaviour Crawly.Spider
+  alias Crawly.Utils
+
+  @impl Crawly.Spider
+  def base_url(), do: "https://www.erlang-solutions.com"
+
+  @impl Crawly.Spider
+  def init(), do: [start_urls: ["https://www.erlang-solutions.com/blog.html"]]
+
+  @impl Crawly.Spider
+  def parse_item(response) do
+    hrefs = response.body |> Floki.find("a.more") |> Floki.attribute("href")
+
+    requests =
+      Utils.build_absolute_urls(hrefs, base_url())
+      |> Utils.requests_from_urls()
+
+    title = response.body |> Floki.find("article.blog_post h1") |> Floki.text()
+
+    %{
+      :requests => requests,
+      :items => [%{title: title, url: response.request_url}]
+    }
+  end
+end
+EOF
+```
+
+5. Configure Crawly:
+By default Crawly does not require any configuration. But obviusely you will need
+a configuration for fine tuning the Crawls:
+
+```elixir
+config :crawly,
+  closespider_timeout: 10,
+  concurrent_requests_per_domain: 8,
+  follow_redirects: true,
+  closespider_itemcount: 1000,
+  output_format: "csv",
+  item: [:title, :url],
+  item_id: :title,
+  middlewares: [
+    Crawly.Middlewares.DomainFilter,
+    Crawly.Middlewares.UniqueRequest,
+    Crawly.Middlewares.UserAgent
+  ],
+  pipelines: [
+    Crawly.Pipelines.Validate,
+    Crawly.Pipelines.DuplicatesFilter,
+    Crawly.Pipelines.CSVEncoder
+  ]
+```
+
+
+6. Start the Crawl:
+ - `iex -S mix`
+ - `Crawly.Engine.start_spider(EslSpider)`
+
+7. Results can be seen in: `cat /tmp/EslSpider.jl`
+
 
 # Documentation